Previous research in human gesture recognition has largely overlooked multi-person interactions, which are crucial for understanding the social context of naturally occurring gestures. This limitation in existing datasets presents a significant challenge in aligning human gestures with other modalities like language and speech. To address this issue, we introduce SocialGesture, the first large-scale dataset specifically designed for multi-person gesture analysis. SocialGesture features a diverse range of natural scenarios and supports multiple gesture analysis tasks, including video-based recognition and temporal localization, providing a valuable resource for advancing the study of gesture during complex social interactions. Furthermore, we propose a novel visual question answering (VQA) task to benchmark vision language models' (VLMs) performance on social gesture understanding. Our findings highlight several limitations of current gesture recognition models, offering insights into future directions for improvement in this field. SocialGesture is available at https://huggingface.co/datasets/IrohXu/SocialGesture.
Example frames and comparisons with other video-based gesture datasets. SocialGesture is the only dataset featuring multi-person interactions and focusing on natural gestures with meaningful social communication.
SocialGesture is composed of diverse multi-person social interactions from YouTube and Ego4D.
Our dataset supports temporal action localization, gesture recognition, and gesture type classification. Temporal action localization consists of not only identifying temporal intervals of gesture instances but also estimating corresponding confidence values. Gesture recognition is a binary classification task for detecting whether gestures are present. Gesture type classification categorizes recognized gestures into one of the four deictic social gestures in our dataset (pointing, showing, reaching, giving).
Demonstration of our novel social gesture visual question-answering task (SocialVQA). SocialVQA consists of three subtasks: Global Perception, Gesture Understanding, and Gesture Localization. Global Perception is intended to benchmark models' basic comprehension capabilities, such as counting the number of people in each scene or providing a scene description. Gesture Understanding consists of both gesture recognition and type classification (see above). Gesture Localization uses gesture iniator and target bounding box annotations to test models' ability to localize iniators and targets, as well as determine if the target is human. See below for evaluation metrics on these tasks.
We report the performance of both open-source and closed-source SOTA VLMs on the SocialVQA task. Gesture Localization tasks, related to target localization and classification, are among the most challenging. Interestingly, none of the VLMs we test exceed 70% accuracy on the simplest task of counting humans. Additionally, we finetune Qwen2-VL-7B fully and with LoRA and achieve improvement over the base model in key metrics.
@inproceedings{cao2025socialgesture,
title={SocialGesture: Delving into Multi-person Gesture Understanding},
author={Cao, Xu and Virupaksha, Pranav and Jia, Wenqi and Lai, Bolin and Ryan, Fiona and Lee, Sangmin and Rehg, James M},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2025}
}