V-HUB : A VISUAL-CENTRIC HUMOR UNDERSTANDING BENCHMARK FOR VIDEO LLMS

Zhengpeng Shi^{1,4 *}, Hengli Li^{2,4 *}, Yanpeng Zhao^{4 † ✉}, Jianqun Zhou^3,4, Yuxuan Wang⁵, Qinrong Cui⁵, Wei Bi⁵, Songchun Zhu⁴, Bo Zhao^{1 ✉}, Zilong Zheng^{4 ✉}
¹Shanghai Jiao Tong University; ²Peking University; ³Wuhan University; ⁴Beijing Institute for General Artificial Intelligence; ⁵Independent Researcher;

arXiv

Dataset Code

Abstract

AI models capable of comprehending humor hold real-world promise—for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel visual-centric video humor understanding benchmark. v-HUB comprises a curated collection of minimally verbal short videos, sourced from classic silent films and online resources, and reflecting real-world scenarios where humor can be appreciated purely through visual cues. Each video clip is paired with rich annotations, including captions, descriptions, and explanations, supporting evaluation tasks like caption matching and humor explanation. To broaden its applicability, we further construct an open-ended video QA task, making it readily integrable into existing video understanding benchmarks. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. For example, all models exhibit a marked performance drop on caption matching when moving from text-based to video-based evaluation (without audio). Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the informativeness of sound and the promise of integrating richer modalities for complex video understanding tasks.

Tasks Definition

To comprehensively evaluate the capability of MLLMs in humor understanding, we propose three tasks that reflect different aspects of humor reasoning: Caption Matching, Humor Explanation, and Open-ended QA.

In this discriminative task, models must correctly associate videos with their corresponding captions. Unlike ordinary caption matching tasks, our design challenges MLLMs to go beyond surface-level matching and assess their ability to understand video humor that is pronounced by creative captions from a generation perspective. For each video with a creative caption, we randomly sample four descriptive captions from other videos as the distractors.

In this generative task, models must identify humor points within each video, provide coherent explanations, and reference relevant visual or auditory cues.

To further assess the fundamental understanding of video content, we generate a set of open-ended question-answer pairs for each video. These questions—automatically generated by GPT-4o and manually verified—encompass temporal, descriptive, and causal aspects. This extends the benchmark beyond humor-specific reasoning, providing a broader assessment of video reasoning skills.

Comparison

Data Statistic

Data Curation Pipeline

Experiment Results

Our results reveal several shortcomings of MLLMs:
I. Struggling to identify humorous elements when explicit cues are absent.
II. Inadequate integration of information across modalities for understanding.
III. Limited capacity for inferring subtle humor.
IV. Heavy reliance on linguistic cues for humor understanding.
V. Weakness in deriving nuanced visual cues for understanding sophisticated video humor, although incorporating audio helps with video humor understanding.

More Examples

License

v-HUB is only used for academic research. Commercial use in any form is prohibited. It contains a collection of funny videos collected from two complementary domains. Therefore, the copyright of all videos belongs to the video owners. If there is any infringement in v-HUB, please email shi_zpeng@sjtu.edu.cn and we will remove it immediately. Without prior approval, you cannot distribute, publish, copy, disseminate, or modify v-HUB in whole or in part. You must strictly comply with the above restrictions.

Citation

@article{shi2025v, title={V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs}, author={Shi, Zhengpeng and Li, Hengli and Zhao, Yanpeng and Zhou, Jianqun and Wang, Yuxuan and Cui, Qinrong and Bi, Wei and Zhu, Songchun and Zhao, Bo and Zheng, Zilong}, journal={arXiv preprint arXiv:2509.25773}, year={2025} }