ChartGaze: Enhancing Chart Understanding in LVLMs
with Eye-Tracking Guided Attention Refinement

Ali Salamatian1,  Amirhossein Abaskohi1,  Wan-Cyuan Fan1,2,  Mir Rayat Imtiaz Hossain1,2,  Leonid Sigal1,2,3,  Giuseppe Carenini1
1University of British Columbia  •  2Vector Institute for AI  •  3CIFAR AI Chair
EMNLP 2025
ChartGaze Teaser

Large Vision-Language Models (LVLMs) often attend to irrelevant chart regions when answering questions. ChartGaze collects human eye-tracking data during chart reasoning and uses it to align model attention with human gaze patterns, improving both accuracy and interpretability.

Abstract

Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.


ChartGaze Dataset

We introduce ChartGaze, a new eye-tracking dataset pairing chart images with human gaze recordings collected during chart question answering. Participants viewed charts on a calibrated eye-tracker setup while answering reasoning questions, yielding fine-grained fixation maps that capture where humans look when interpreting visual data.

The dataset spans multiple chart types and question categories (trend analysis, value extraction, comparison, etc.) and includes quality-controlled gaze maps paired with verified question-answer pairs.

Eye-tracking UI setup

Eye-tracking user interface used for gaze data collection. Participants answered questions about charts while a calibrated tracker recorded fixation patterns.

Dataset creation pipeline

Overview of the dataset creation pipeline: chart images are paired with generated QA pairs, and human gaze is recorded and post-processed into fixation maps.


Gaze-Guided Attention Refinement

We propose a training-time attention refinement that minimizes the divergence between a model's image-text cross-attention maps and the corresponding human gaze fixation maps. The key insight is that human gaze serves as a proxy for where relevant information lives in a chart; if a model's attention aligns with where humans look, it is more likely to focus on task-relevant content.

Key components:

  • Attention extraction: We extract cross-attention maps from early transformer layers, which carry the most spatial information.
  • Gaze supervision: Fixation maps are post-processed (Gaussian smoothed with parameter σ) to match the spatial resolution of attention maps.
  • Combined loss: The standard language modeling loss is augmented with an attention alignment loss, guiding the model without sacrificing language generation quality.
Training pipeline overview

Overview of the gaze-guided attention refinement training. Attention maps are extracted from the LVLM and compared against human fixation maps via a KL-divergence-based loss term.

Attention map comparison

Comparison of attention maps from models trained with (right) and without (left) the gaze-guided refinement loss. Models trained with ChartGaze data attend more closely to chart-relevant regions.


Results

Our gaze-guided refinement yields consistent improvements across four LVLMs on the ChartQA benchmark. Models fine-tuned with attention-guided loss outperform those trained with language loss alone, with gains of up to 2.56 pp in QA accuracy. Notably, gaze supervision also dramatically improves attention alignment (CC, KL, SIM), confirming that the model learns to attend to chart regions that matter.

TrainingModelTest Acc. (%)CC ↑KL ↓SIM ↑
Zero-shot
TinyLLaVA-450M46.64-0.0781.8100.267
InternVL2-4B49.86-0.0601.7220.282
InternVL2-8B50.93-0.0541.6810.296
ChartGemma-3B52.390.1001.5590.323
Fine-tuned (language loss only)
TinyLLaVA-450M62.58 ± 0.27-0.048 ± 0.0051.705 ± 0.0310.288 ± 0.004
InternVL2-4B63.91 ± 0.20-0.028 ± 0.0041.532 ± 0.0100.301 ± 0.004
InternVL2-8B65.36 ± 0.22-0.017 ± 0.0031.487 ± 0.0090.312 ± 0.004
ChartGemma-3B72.49 ± 1.690.092 ± 0.0041.594 ± 0.0260.316 ± 0.003
Fine-tuned (gaze supervision + language loss)
TinyLLaVA-450M63.77 ± 0.540.391 ± 0.0071.132 ± 0.0150.439 ± 0.002
InternVL2-4B65.45 ± 0.230.402 ± 0.0061.072 ± 0.0080.451 ± 0.004
InternVL2-8B67.92 ± 0.150.417 ± 0.0061.036 ± 0.0070.468 ± 0.005
ChartGemma-3B72.67 ± 1.240.436 ± 0.0111.033 ± 0.0140.452 ± 0.005

Performance of models trained with and without gaze supervision. ↑ / ↓ indicates higher / lower is better. Underline = best within baseline group; bold = best overall.


Ablation Studies

We conduct four ablation studies on InternVL2-8B (unless otherwise noted) to validate design choices in ChartGaze.

Masked Inference: Does the model rely on human-attended regions?

We test whether gaze-supervised models truly depend on human-attended chart regions by blurring or masking those areas at inference. A model that merely mimics gaze patterns would not suffer; a model that genuinely uses those regions for reasoning would see large accuracy drops.

ConditionAcc. ↑CC ↑KL ↓SIM ↑
Language loss only (baseline)
Unperturbed65.36-0.0171.4870.312
Blur human gaze areas61.02-0.1391.6810.236
Mask human gaze areas60.14-0.1241.7130.221
Blur non-gaze areas64.100.1121.3920.298
Mask non-gaze areas62.850.0641.4560.274
Gaze supervision + language loss (ours)
Unperturbed67.920.4171.0360.468
Blur human gaze areas60.84-0.1741.7940.201
Mask human gaze areas59.92-0.1521.7520.188
Blur non-gaze areas66.820.2841.2180.395
Mask non-gaze areas63.720.2031.3140.356

Gaze-supervised models suffer a much larger accuracy drop (−7.08% blur, −8.00% mask) than language-only models (−4.34%, −5.22%), confirming that gaze supervision induces genuine semantic reliance on human-attended regions, not just visual mimicry.

Loss Function Comparison

We compare four loss functions for the attention alignment objective on TinyLLaVA-450M: weighted MSE (W-MSE), KL Divergence, Focal Loss, and Dice + BCE.

Loss FunctionTest Acc. ↑CC ↑KL ↓SIM ↑
W-MSE (ours)64.530.3861.1450.438
KL Divergence62.360.3061.2090.380
Focal Loss61.060.3391.1880.388
Dice + BCE60.410.1944.1740.183
Effect of Dataset Size

We evaluate gaze supervision under low-data settings (25%, 50%, 100% of ChartGaze) on InternVL2-8B with three seeds per condition. Accuracy gains from gaze supervision become more pronounced as data decreases, demonstrating the value of attention supervision in low-resource settings.

Training SetupTest Acc. ↑CC ↑KL ↓SIM ↑
Without Attention Supervision
25% data60.21 ± 0.73-0.045 ± 0.0051.602 ± 0.0120.274 ± 0.006
50% data63.58 ± 0.24-0.028 ± 0.0041.530 ± 0.0100.295 ± 0.005
100% data65.36 ± 0.22-0.017 ± 0.0031.487 ± 0.0090.312 ± 0.004
With Attention Supervision
25% data64.07 ± 0.260.297 ± 0.0081.174 ± 0.0110.402 ± 0.006
50% data66.51 ± 0.200.396 ± 0.0071.065 ± 0.0090.454 ± 0.005
100% data67.92 ± 0.150.417 ± 0.0061.036 ± 0.0070.468 ± 0.005
Gaze Map Post-Processing: σ Sensitivity

We apply a Gaussian filter (σ) to gaze fixation maps before computing the attention loss. We train TinyLLaVA-450M with three σ values. σ = 40 achieves the best accuracy; σ = 80 inflates alignment metrics by spreading attention too broadly, hurting accuracy.

Fixation σTest Acc. ↑CC ↑KL ↓SIM ↑
2062.890.2771.8750.272
40 (ours)63.770.3911.1320.439
8061.490.4900.5880.612

Citation

@inproceedings{salamatian-etal-2025-chartgaze, title = "{C}hart{G}aze: Enhancing Chart Understanding in {LVLM}s with Eye-Tracking Guided Attention Refinement", author = "Salamatian, Ali and Abaskohi, Amirhossein and Fan, Wan-Cyuan and Hossain, Mir Rayat Imtiaz and Sigal, Leonid and Carenini, Giuseppe", booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", pages = "12093--12113", doi = "10.18653/v1/2025.emnlp-main.607" }