MAPLM: A Real-World Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding
Xu Cao*, Tong Zhou*,
Yunsheng Ma*,
Wenqian Ye,
Can Cui, Kun Tang, Zhipeng Cao, Kaizhao Liang,
Ziran Wang,
James M. Rehg, and Chao Zheng
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR)
, 2024
Vision-language generative AI has demonstrated remarkable promise for empowering cross-modal scene understanding of autonomous driving and high-definition (HD) map systems. However, current benchmark datasets lack multi-modal point cloud, image, and language data pairs. Recent approaches utilize visual instruction learning and cross-modal prompt engineering to expand vision-language models into this domain. In this paper, we propose a new vision-language benchmark that can be used to finetune traffic and HD map domain-specific foundation models. Specifically, we annotate and leverage large-scale, broad-coverage traffic and map data extracted from huge HD map annotations, and use CLIP and LLaMA-2 / Vicuna to finetune a baseline model with instruction-following data. Our experimental results across various algorithms reveal that while visual instruction-tuning large language models (LLMs) can effectively learn meaningful representations from MAPLM-QA, there remains significant room for further advancements.