ACL 2026 Findings

Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

Chentao Li1, Zirui Gao1, Mingze Gao2, Yinglian Ren1, Jianjiang Feng1*, Jie Zhou1

1Department of Automation, Tsinghua University

2Academy of Art & Design, Tsinghua University

*Corresponding author

11,729 QA samples Built from both large-scale simulation and real-world egocentric capture.
Basic Perception to Adversarial Resilience Coverage spans perception, function and state, spatial context, OCR, and robustness under misleading references.
L1 to L3 deixis From explicit pointing expressions to visually grounded but fully implicit pronouns.
Sim-to-Real benchmark design Structured to support both evaluation and model improvement for grounded referential reasoning.

Overview + Motivation

Why egocentric pointing remains hard for current MLLMs

EgoPoint-Bench evaluates multimodal pointing reasoning in egocentric vision. It targets a core failure mode we call referential hallucination: models answer about nearby or visually salient objects instead of the true referent indicated by a first-person pointing gesture.

This failure is especially common in egocentric scenes, where hand proximity, clutter, and partial visibility make naive visual grounding unreliable. EgoPoint-Bench is built to expose that gap directly, rather than treating pointing as generic visual QA.

The benchmark combines a scalable simulation pipeline with real-world capture, covering basic perception, function and state, spatial context, OCR, and adversarial resilience.

Teaser showing referential hallucination in egocentric pointing scenarios.
In egocentric pointing scenes, current MLLMs often predict plausible but incorrect referents, focusing on nearby or salient distractors instead of the object truly indicated by the pointing gesture. This referential hallucination phenomenon motivates EgoPoint-Bench.

Pipeline

From Point-Sim supervision to real-world evaluation

Overview of the EgoPoint-Bench data construction pipeline.
EgoPoint-Bench integrates simulation, real capture, and capability-oriented annotation for grounded egocentric pointing QA.

Highlights

What the benchmark covers

Capability Taxonomy

Five evaluation dimensions: Basic Perception, Function & State, Spatial Context, OCR, and Adversarial Resilience.

Hierarchical Deixis

Three deixis levels capture the ambiguity spectrum from explicit reference to fully implicit pronouns.

Mixed Question Format

Multiple-choice, true/false, and open-ended questions balance objective benchmarking with realistic user intent.

Sim-to-Real Focus

The dataset is constructed to support training and transfer, not just static leaderboard evaluation.

Examples

Mini dataset preview

Sample real-world pointing example from EgoPoint-Bench.
Real-world sample
Sample simulation pointing example from EgoPoint-Bench.
Simulation sample
Another real-world pointing example from EgoPoint-Bench.
Hard visual grounding case

Failure Mode

Referential hallucination is visually plausible but wrong

Beyond the teaser examples, the error pattern is consistent: strong multimodal models often confuse the intended referent with objects close to the hand, along cluttered shelves, or in visually dominant regions.

EgoPoint-Bench is designed to measure this failure directly and to support methods that improve grounded gesture reasoning under real egocentric ambiguity.

Examples showing referential hallucination errors.
Illustrative failure cases where base models choose plausible distractors while fine-tuned models recover the intended referent.