Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving

Abstract

Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries.

We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes.

Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute prediction for box-referred objects, (2) motion understanding of target instances, and (3) spatiotemporal motion reasoning over inter-object dynamics across frames.

To support this, we crowd-sourced fine-grained object classes and visual attributes that reflect the complexity drivers encounter, and extract object trajectories to construct temporally grounded QA pairs. Rigorous quality control through negative sampling, temporal consistency checks, and difficulty-aware balancing guarantee dataset robustness and diversity.

Our comprehensive evaluation reveals significant limitations in current VLMs when queried about perception questions, highlighting the gap in achieving real-world performance. This work provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions.

Dataset Statistics

20,779

Total Questions

1,662

Binary Questions

5,403

Attribute Questions

13,714

Motion Questions

202

Scenes

50%

Objects Annotated

Box-QAymo Dataset Pipeline

Overview of the Box-QAymo dataset pipeline for evaluating vision-language models (VLMs). Step 1 extracts 3D metadata from Waymo which is enhanced with human-annotated semantics. Step 2 introduces box-referenced visual question answering (VQA) tasks spanning instance recognition, motion interpretation, and temporal trajectory reasoning. Step 3 implements rigorous quality control through negative sampling, temporal consistency filtering, and difficulty-aware balancing to ensure a robust and challenging dataset. Step 4 benchmarks general and domain-specific VLMs in zero-shot and fine-tuned settings.

Question Categories

Binary Questions

Movement Status: "Are there any stationary vehicles?"

Orientation: "Are there any vehicles moving towards the camera?"

Attribute Questions

Fine-grained Classification: "What type of object is in the red box?"

Color Recognition: "What color is the object highlighted in red?"

Facing Direction: "What direction is the object in the red box facing?"

Motion Questions

Speed Assessment: "How fast is the blue sedan moving?"

Movement Direction: "What direction is the object in the red box moving?"

Relative Motion Analysis: "Is the green pickup truck traveling faster than the ego vehicle?"

Traffic Element Recognition: "Is the ego vehicle approaching a stop sign?"

Trajectory Analysis: "Are the ego vehicle and the truck on a collision course?"

Relative Motion Direction: "What is the relative motion direction of the hatchback compared to ego?"

Path Conflict Detection: "Is there a vehicle in the ego vehicle's future path?"

Code & Framework

Box-QAymo provides a comprehensive framework for generating, processing, and evaluating visual question answering (VQA) tasks on the Waymo dataset. The framework supports diverse question types, multiple evaluation metrics, and various answer formats.

Core Components

Data Processing: Waymo dataset extraction and preprocessing
Question Generation: Hierarchical prompt generators for different question types
Model Evaluation: Support for multiple VLMs and evaluation metrics
Answer Processing: Handles multiple choice, text, and bounding box answers

Supported Models

VLMs: LLaVA, Qwen-VL, SENNA
Evaluation Metrics: F1, Precision, Recall
Question Types: Binary, attribute, and motion reasoning

View Code & Setup Instructions

Quick Start

For detailed setup instructions, including Waymo dataset preprocessing, crowd-sourced metadata download, and model evaluation scripts, please visit our GitHub repository. The repository includes:

Complete installation and setup guide
Waymo dataset extraction scripts
VQA dataset generation pipeline
Model evaluation and comparison tools
Pre-trained model integration

Results

Key Findings

Hierarchical Complexity: Performance decreases from binary (66.1%) to attribute (18.3%) to motion (37.6%) questions, validating our complexity assumptions.
Box Grounding: Red bounding boxes consistently improve performance, with Qwen-VL showing +1.39% F1 improvement on average.
Temporal Reasoning: Counter-intuitively, two-frame inputs degrade performance compared to single frames, suggesting current VLMs struggle with short-term temporal integration.
Domain Specificity: Senna's poor performance despite being driving-specific reveals brittleness of narrow task training.

BibTeX

@article{etchegaray2024boxqaymo,
  title={Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving},
  author={Etchegaray, Djamahl, Fu, Yuxia, Huang, Zi and Luo, Yadan},
  journal={arXiv preprint},
  year={2025}
}