Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving

UQMM Lab, University of Queensland, Brisbane, Australia

Abstract

Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries.

We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes.

Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute prediction for box-referred objects, (2) motion understanding of target instances, and (3) spatiotemporal motion reasoning over inter-object dynamics across frames.

To support this, we crowd-sourced fine-grained object classes and visual attributes that reflect the complexity drivers encounter, and extract object trajectories to construct temporally grounded QA pairs. Rigorous quality control through negative sampling, temporal consistency checks, and difficulty-aware balancing guarantee dataset robustness and diversity.

Our comprehensive evaluation reveals significant limitations in current VLMs when queried about perception questions, highlighting the gap in achieving real-world performance. This work provides a foundation for developing more robust and interpretable autonomous driving systems that can communicate effectively with users under real-world conditions.

Dataset Statistics

20,779
Total Questions
1,662
Binary Questions
5,403
Attribute Questions
13,714
Motion Questions
202
Scenes
50%
Objects Annotated

Box-QAymo Dataset Pipeline

Overview of the Box-QAymo dataset pipeline for evaluating vision-language models (VLMs). Step 1 extracts 3D metadata from Waymo which is enhanced with human-annotated semantics. Step 2 introduces box-referenced visual question answering (VQA) tasks spanning instance recognition, motion interpretation, and temporal trajectory reasoning. Step 3 implements rigorous quality control through negative sampling, temporal consistency filtering, and difficulty-aware balancing to ensure a robust and challenging dataset. Step 4 benchmarks general and domain-specific VLMs in zero-shot and fine-tuned settings.

Question Categories

Binary Questions

Movement Status: "Are there any stationary vehicles?"

Orientation: "Are there any vehicles moving towards the camera?"

Attribute Questions

Fine-grained Classification: "What type of object is in the red box?"

Color Recognition: "What color is the object highlighted in red?"

Facing Direction: "What direction is the object in the red box facing?"

Motion Questions

Speed Assessment: "How fast is the blue sedan moving?"

Movement Direction: "What direction is the object in the red box moving?"

Relative Motion Analysis: "Is the green pickup truck traveling faster than the ego vehicle?"

Traffic Element Recognition: "Is the ego vehicle approaching a stop sign?"

Trajectory Analysis: "Are the ego vehicle and the truck on a collision course?"

Relative Motion Direction: "What is the relative motion direction of the hatchback compared to ego?"

Path Conflict Detection: "Is there a vehicle in the ego vehicle's future path?"

Code & Framework

Box-QAymo provides a comprehensive framework for generating, processing, and evaluating visual question answering (VQA) tasks on the Waymo dataset. The framework supports diverse question types, multiple evaluation metrics, and various answer formats.

Core Components

  • Data Processing: Waymo dataset extraction and preprocessing
  • Question Generation: Hierarchical prompt generators for different question types
  • Model Evaluation: Support for multiple VLMs and evaluation metrics
  • Answer Processing: Handles multiple choice, text, and bounding box answers

Supported Models

  • VLMs: LLaVA, Qwen-VL, SENNA
  • Evaluation Metrics: F1, Precision, Recall
  • Question Types: Binary, attribute, and motion reasoning

Quick Start

For detailed setup instructions, including Waymo dataset preprocessing, crowd-sourced metadata download, and model evaluation scripts, please visit our GitHub repository. The repository includes:

  • Complete installation and setup guide
  • Waymo dataset extraction scripts
  • VQA dataset generation pipeline
  • Model evaluation and comparison tools
  • Pre-trained model integration

Results

Key Findings

  • Hierarchical Complexity: Performance decreases from binary (66.1%) to attribute (18.3%) to motion (37.6%) questions, validating our complexity assumptions.
  • Box Grounding: Red bounding boxes consistently improve performance, with Qwen-VL showing +1.39% F1 improvement on average.
  • Temporal Reasoning: Counter-intuitively, two-frame inputs degrade performance compared to single frames, suggesting current VLMs struggle with short-term temporal integration.
  • Domain Specificity: Senna's poor performance despite being driving-specific reveals brittleness of narrow task training.

BibTeX

@article{etchegaray2024boxqaymo,
  title={Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving},
  author={Etchegaray, Djamahl, Fu, Yuxia, Huang, Zi and Luo, Yadan},
  journal={arXiv preprint},
  year={2025}
}