Toward Human Deictic Gesture Target Estimation

1University of Illinois Urbana-Champaign, 2Georgia Institute of Technology
3Korea University, 4The Hong Kong University of Science and Technology (Guangzhou)
Affiliation Logos
NeurIPS 2025
TL;DR: We investigate the novel task of human deictic gesture target estimation, contributing a large-scale, domain-specific dataset and accompanying Transformer-based architecture featuring joint cross-attention between gesture and gaze cues.
Task Overview

Abstract

Humans have a remarkable ability to use co-speech deictic gestures, such as pointing and showing, to enrich verbal communication and support social interaction. These gestures are so fundamental that infants begin to use them even before they acquire spoken language, which highlights their central role in human communication. Understanding the intended targets of another individual’s deictic gestures enables inference of their intentions, comprehension of their current actions, and prediction of upcoming behaviors. Despite its significance, gesture target estimation remains an underexplored task within the computer vision community. In this paper, we introduce GestureTarget, a novel task designed specifically for comprehensive evaluation of social deictic gesture semantic target estimation. To address this task, we propose TransGesture, a set of Transformer-based gesture target prediction models. Given an input image and the spatial location of a person, our models predict the intended target of their gesture within the scene. Critically, our gaze-aware joint cross attention fusion model demonstrates how incorporating gaze-following cues significantly improves gesture target mask prediction IoU by 6% and gesture existence prediction accuracy by 10%. Our results underscore the complexity and importance of integrating gaze cues into deictic gesture intention understanding, advocating for increased research attention to this emerging area.

✅ Contributions

  • TransGesture. A group of Transformer models that integrates human gesture and gaze social cues through a large-scale frozen visual encoder and applies joint cross attention fusion mechanisms to accurately infer gesture targets in complex visual scenes.

  • GestureTarget. A new task and dataset designed for deictic gesture target estimation, containing about 20K annotated instances of pointing, reaching, showing, and giving gestures with corresponding target mask annotations.

  • Gaze Integration. Incorporating gaze target estimation as an auxiliary modality can significantly improve understanding of deictic gesture targets, highlighting the importance of gaze as a critical cue in understanding nonverbal human communication.

Framework Overview

TransGesture Framework

TransGesture Architecture Overview. We use frozen DINOv2 as a visual encoder and combine the resulting visual tokens with body and head patch encodings for the gesture and gaze decoders, respectively. We fuse the resulting gaze and gesture tokens via joint cross-attention and predict the target mask. The gesture decoder also predicts gesture existence.

Qualitative Results

Qualitative Evaluation of Different Fusion Strategies

Qualitative examples of gesture target estimation under different fusion strategies. Green bounding boxes indicate the gesture initiator, and red masks show the predicted target person.

Quantitative Results

We conduct comprehensive evaluations on gesture existence prediction accuracy and target mask prediction IoU. We comparing different fusion strategies and visual encoders and include human baselines for context.

Comparison Results

Exploring the influence of CLIP/SigLIP-based visual encoder and DINOv2 in gesture target estimation with different gaze and gesture feature fusion strategies. With all four visual encoders, our proposed Gesture-Gaze Joint Cross-Attention strategy performs the best, underscoring the importance of gaze cues. In human baselines, we compare the average human performance and the maximum human performance.

Additional Ablations & Results

We qualitatively compare token-affinity map visualization for CLIP, SigLIP, SigLIP2, and DINOv2.

Token-Wise Affinity Map Comparison

DINOv2 achieves the strongest performance on our task. DINOv2 preserves notably stronger intra-instance token affinity, resulting in more coherent and spatially structured representations. In contrast, CLIP- and SigLIP-based models exhibit weaker spatial coherence, indicating limitations in modeling spatial relationships in complex social scenes involving multiple humans.

Visual Encoder Scale Ablation

Impact of Freezing Different Modules

We furthermore run ablations on the scale of visual encoders, along with the impact of freezing different parts of our architecture.

BibTeX

@inproceedings{cao2025toward,
  title={Toward Human Deictic Gesture Target Estimation},
  author={Cao, Xu and Virupaksha, Pranav and Lee, Sangmin and Lai, Bolin and Jia, Wenqi and Chen, Jintai and Rehg, James Matthew},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025}
}