V-HUB : A VISUAL-CENTRIC HUMOR UNDERSTANDING BENCHMARK FOR VIDEO LLMS

Zhengpeng Shi1,4 *, Hengli Li2,4 *, Yanpeng Zhao4 † ✉, Jianqun Zhou3,4, Yuxuan Wang5, Qinrong Cui5, Wei Bi5, Songchun Zhu4, Bo Zhao1 ✉, Zilong Zheng4 ✉
1Shanghai Jiao Tong University; 2Peking University; 3Wuhan University; 4Beijing Institute for General Artificial Intelligence; 5Independent Researcher;

Abstract

AI models capable of comprehending humor hold real-world promise—for example, enhancing engagement in human-machine interactions. To gauge and diagnose the capacity of multimodal large language models (MLLMs) for humor understanding, we introduce v-HUB, a novel visual-centric video humor understanding benchmark. v-HUB comprises a curated collection of minimally verbal short videos, sourced from classic silent films and online resources, and reflecting real-world scenarios where humor can be appreciated purely through visual cues. Each video clip is paired with rich annotations, including captions, descriptions, and explanations, supporting evaluation tasks like caption matching and humor explanation. To broaden its applicability, we further construct an open-ended video QA task, making it readily integrable into existing video understanding benchmarks. We evaluate a diverse set of MLLMs, from specialized Video-LLMs to versatile OmniLLMs that can process audio, covering both open-source and proprietary domains. The experimental results expose the difficulties MLLMs face in comprehending humor from visual cues alone. For example, all models exhibit a marked performance drop on caption matching when moving from text-based to video-based evaluation (without audio). Our findings also demonstrate that incorporating audio helps with video humor understanding, highlighting the informativeness of sound and the promise of integrating richer modalities for complex video understanding tasks.

Cover

Tasks Definition

To comprehensively evaluate the capability of MLLMs in humor understanding, we propose three tasks that reflect different aspects of humor reasoning: Caption Matching, Humor Explanation, and Open-ended QA.

In this discriminative task, models must correctly associate videos with their corresponding captions. Unlike ordinary caption matching tasks, our design challenges MLLMs to go beyond surface-level matching and assess their ability to understand video humor that is pronounced by creative captions from a generation perspective. For each video with a creative caption, we randomly sample four descriptive captions from other videos as the distractors.

Example

Comparison

Comparison

Data Statistic

Data Statistic

Data Curation Pipeline

Pipeline

Experiment Results

Results

Our results reveal several shortcomings of MLLMs:
I. Struggling to identify humorous elements when explicit cues are absent.
II. Inadequate integration of information across modalities for understanding.
III. Limited capacity for inferring subtle humor.
IV. Heavy reliance on linguistic cues for humor understanding.
V. Weakness in deriving nuanced visual cues for understanding sophisticated video humor, although incorporating audio helps with video humor understanding.

More Examples

Example 1
Example 2
Example 3

License

v-HUB is only used for academic research. Commercial use in any form is prohibited. It contains a collection of funny videos collected from two complementary domains. Therefore, the copyright of all videos belongs to the video owners. If there is any infringement in v-HUB, please email shi_zpeng@sjtu.edu.cn and we will remove it immediately. Without prior approval, you cannot distribute, publish, copy, disseminate, or modify v-HUB in whole or in part. You must strictly comply with the above restrictions.

Citation

@article{shi2025v, title={V-HUB: A Visual-Centric Humor Understanding Benchmark for Video LLMs}, author={Shi, Zhengpeng and Li, Hengli and Zhao, Yanpeng and Zhou, Jianqun and Wang, Yuxuan and Cui, Qinrong and Bi, Wei and Zhu, Songchun and Zhao, Bo and Zheng, Zilong}, journal={arXiv preprint arXiv:2509.25773}, year={2025} }