ACL 2026 Findings

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

Chentao Li1, Zirui Gao1, Mingze Gao2, Yinglian Ren1, Jianjiang Feng1*, Jie Zhou1

1Department of Automation, Tsinghua University

2Academy of Art & Design, Tsinghua University

*Corresponding author

11,729 QA samples Built from both large-scale simulation and real-world egocentric capture.
Basic Perception to Adversarial Resilience Coverage spans perception, function and state, spatial context, OCR, and robustness under misleading references.
L1 to L3 deixis From explicit pointing expressions to visually grounded but fully implicit pronouns.
Sim-to-Real benchmark design Structured to support both evaluation and model improvement for grounded referential reasoning.

Overview + Motivation

Why egocentric pointing remains hard for current MLLMs

EgoPoint-Bench evaluates multimodal pointing reasoning in egocentric vision. It targets a core failure mode we call referential hallucination: models answer about nearby or visually salient objects instead of the true referent indicated by a first-person pointing gesture.

This failure is especially common in egocentric scenes, where hand proximity, clutter, and partial visibility make naive visual grounding unreliable. EgoPoint-Bench is built to expose that gap directly, rather than treating pointing as generic visual QA.

The benchmark combines a scalable simulation pipeline with real-world capture, covering basic perception, function and state, spatial context, OCR, and adversarial resilience.

Teaser showing referential hallucination in egocentric pointing scenarios.
In egocentric pointing scenes, current MLLMs often predict plausible but incorrect referents, focusing on nearby or salient distractors instead of the object truly indicated by the pointing gesture. This referential hallucination phenomenon motivates EgoPoint-Bench.

Pipeline

From Point-Sim supervision to real-world evaluation

Overview of the EgoPoint-Bench data construction pipeline.
EgoPoint-Bench integrates simulation, real capture, and capability-oriented annotation for grounded egocentric pointing QA.

Highlights

What the benchmark covers

Capability Taxonomy

Five evaluation dimensions: Basic Perception, Function & State, Spatial Context, OCR, and Adversarial Resilience.

Hierarchical Deixis

Three deixis levels capture the ambiguity spectrum from explicit reference to fully implicit pronouns.

Mixed Question Format

Multiple-choice, true/false, and open-ended questions balance objective benchmarking with realistic user intent.

Sim-to-Real Focus

The dataset is constructed to support training and transfer, not just static leaderboard evaluation.

Examples

Mini dataset preview

Real-world subset

Simulation subset

Failure Mode

Referential hallucination is visually plausible but wrong

Beyond the teaser examples, the error pattern is consistent: strong multimodal models often confuse the intended referent with objects close to the hand, along cluttered shelves, or in visually dominant regions.

EgoPoint-Bench is designed to measure this failure directly and to support methods that improve grounded gesture reasoning under real egocentric ambiguity.

Examples showing referential hallucination errors.
Illustrative failure cases where base models choose plausible distractors while fine-tuned models recover the intended referent.

Main Results

Simulation tuning improves egocentric deictic reasoning and transfers to real-world scenes

Table 2 results on simulation and real-world testsets for proprietary and open-source models.
Table 2 compares direct inference and LoRA tuning across proprietary and open-source models on both simulation and real-world testsets.

Off-the-shelf Limits

Directly prompted VLMs remain weak on fine-grained egocentric deictic understanding, especially under ambiguity and distractors.

Large Gains from Simulation

Simulation-based LoRA tuning delivers consistent and often substantial gains over direct inference across most open-source backbones.

Effective Sim-to-Real Transfer

Performance improvements learned in simulation largely carry over to real-world data, indicating strong practical generalization.

Citation

BibTeX

@misc{li2026mllmsunderstandpointingbenchmarking,
  title={Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision},
  author={Chentao Li and Zirui Gao and Mingze Gao and Yinglian Ren and Jianjiang Feng and Jie Zhou},
  year={2026},
  eprint={2604.21461},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.21461},
}