Humans have a remarkable ability to use co-speech deictic gestures, such as pointing and showing, to enrich verbal communication and support social interaction. These gestures are so fundamental that infants begin to use them even before they acquire spoken language, which highlights their central role in human communication. Understanding the intended targets of another individual’s deictic gestures enables inference of their intentions, comprehension of their current actions, and prediction of upcoming behaviors. Despite its significance, gesture target estimation remains an underexplored task within the computer vision community. In this paper, we introduce GestureTarget, a novel task designed specifically for comprehensive evaluation of social deictic gesture semantic target estimation. To address this task, we propose TransGesture, a set of Transformer-based gesture target prediction models. Given an input image and the spatial location of a person, our models predict the intended target of their gesture within the scene. Critically, our gaze-aware joint cross attention fusion model demonstrates how incorporating gaze-following cues significantly improves gesture target mask prediction IoU by 6% and gesture existence prediction accuracy by 10%. Our results underscore the complexity and importance of integrating gaze cues into deictic gesture intention understanding, advocating for increased research attention to this emerging area.
TransGesture Architecture Overview. We use frozen DINOv2 as a visual encoder and combine the resulting visual tokens with body and head patch encodings for the gesture and gaze decoders, respectively. We fuse the resulting gaze and gesture tokens via joint cross-attention and predict the target mask. The gesture decoder also predicts gesture existence.
Qualitative examples of gesture target estimation under different fusion strategies. Green bounding boxes indicate the gesture initiator, and red masks show the predicted target person.
We conduct comprehensive evaluations on gesture existence prediction accuracy and target mask prediction IoU. We comparing different fusion strategies and visual encoders and include human baselines for context.
Exploring the influence of CLIP/SigLIP-based visual encoder and DINOv2 in gesture target estimation with different gaze and gesture feature fusion strategies. With all four visual encoders, our proposed Gesture-Gaze Joint Cross-Attention strategy performs the best, underscoring the importance of gaze cues. In human baselines, we compare the average human performance and the maximum human performance.
We qualitatively compare token-affinity map visualization for CLIP, SigLIP, SigLIP2, and DINOv2.
DINOv2 achieves the strongest performance on our task. DINOv2 preserves notably stronger intra-instance token affinity, resulting in more coherent and spatially structured representations. In contrast, CLIP- and SigLIP-based models exhibit weaker spatial coherence, indicating limitations in modeling spatial relationships in complex social scenes involving multiple humans.
We furthermore run ablations on the scale of visual encoders, along with the impact of freezing different parts of our architecture.
@inproceedings{cao2025toward,
title={Toward Human Deictic Gesture Target Estimation},
author={Cao, Xu and Virupaksha, Pranav and Lee, Sangmin and Lai, Bolin and Jia, Wenqi and Chen, Jintai and Rehg, James Matthew},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025}
}