From Transformers to Multi Modal LLMs: Enhancing Perception and Action in ADAS and Autonomous Driving
Written by Praveen Kumar Vemula, Principal Architect | Aarsha Mithra V, Software Engineer 05 May, 2025
Transformers, a ground-breaking neural network architecture born from natural language processing (NLP), have since transformed various domains, including the automotive sector. Their evolution into vision transformers has revolutionized computer vision tasks – significantly boosting perception and scene understanding in Advanced Driver Assistance Systems (ADAS) and Autonomous Driving (AD).
Recent advancements in Large Language Models (LLMs) have pushed the boundaries of reasoning capabilities. For instance, OpenAI’s o3 recently surpassed the ARC’s AGI benchmark, demonstrating robust chain-of-thought reasoning and problem-solving. Innovations such as Large Concept Models and Titan models are further extending these capabilities, enhancing not only natural language understanding but also enabling complex decision-making and planning. This makes them highly relevant for next-gen ADAS and AD applications.
What Are Multimodal Large Language Models?
Traditional LLMs vs. Multimodal LLMs
Traditional LLMs are designed primarily for text-based tasks – understanding and generating human language. In contrast, Multimodal Large Language Models (MMLLMs) are capable of processing and integrating multiple sensor modalities such as cameras, microphones, LiDAR, and RADAR. By fusing this sensor data, MMLLMs can generate meaningful outputs like predictions, actions, or commands, resulting in a more holistic understanding of the environment.
Vision-Language Models (VLMs)
A prominent subset of MMLLMs, Vision-Language Models (VLMs) combine visual data with textual information. Originating from early image captioning systems around 2015, VLMs have matured alongside LLMs to tackle nuanced challenges in ADAS and AD – particularly long-tail scenarios, where rare but critical events may occur.
Key Applications of Multimodal LLMs in ADAS and Autonomous Driving
MMLLMs are currently being employed across a spectrum of tasks extending well beyond offline simulations to real-time applications.
- Dataset Generation: Automating annotation and generating synthetic data for edge cases (e.g., a pedestrian unexpectedly falling onto the road).
- Anomaly Detection: Real-time detection of unusual or dangerous events—crucial for safety.
- Scene Understanding: Improved object detection, classification, and semantic segmentation to support intelligent decision-making.
- Actionable Decision-Making: Vision-Language Action (VLA) systems can not only understand their surroundings but also recommend or execute actions — such as braking, lane changes, or driver alerts based on sensor fusion.
Industry Adoption
While many MMLLMs remain in the research or offline stages, Cyient uses MLLMs for auto annotation, dataset generation and test scenario generation for robust training and faster time to market. Several players are pushing toward real-time integration such as:
- NVIDIA: Demonstrated a two-staged anomaly detection system using MMLLMs on its DRIVE platform. A faster module detects anomalies and queries a slower LLM for context-based control decisions —a blend of speed and depth in reasoning.
- Nuro: Employs MMLLMs within its LAMBDA system for enhanced scene understanding, rider interaction and explainable AI.
- Waymo: Introduced EMMA, a multimodal, end-to-end Vision-Language-Action model that consolidates perception, localization, planning and control within a single neural network.
- Li Auto: In collaboration with Tsinghua University, Li Auto developed DriveVLM—a VLM, then integrated a VLM into their ADMax platform and deployed via OTA (Over the Air). This system supports real-time decision-making based on multimodal inputs.
Current Challenges in Real-Time Deployment
Despite their promise MMLLMs face hurdles in live automotive environments:
- High Computational Requirements:
These models demand extensive computing power, impacting inference speed. However, dedicated accelerators such as NVIDIA’s Jetson Thor are helping bridge this gap. - Handling Long-Term Sequences:
MMLLMs still struggle with processing long temporal sequences—key for navigating complex, evolving environments. New paradigms like Multimodal Visualization-of-Thought (MVoT) are emerging to improve long-term reasoning capabilities.
Conclusion
From their roots in NLP to advanced multimodal architectures, transformers have significantly enhanced the perception and scene understanding capabilities in ADAS and autonomous driving. With cutting-edge reasoning capabilities and sensor fusion, multimodal LLMs are poised to play a critical role in the next wave of ADAS and autonomous driving systems. While challenges remain in scaling these models for real-time use, ongoing advancements signal a future where perception and action converge seamlessly through intelligent, unified models.
About the Authors

Praveen Kumar Vemula
Principal Architect
With over 20 years of experience in interdisciplinary technology solutioning and complex engineering collaboration, Praveen brings deep expertise in product management, product development, and design thinking—focused on Software-Defined Everything (SDx) and digital transformation. A collaborative leader, he offers strategic guidance to business stakeholders through thought leadership, market research, and go-to-market planning for new offerings. Praveen is also a core member of Cyient’s Intelligent Product Platform (IPP) initiative, driving innovation at the intersection of software and smart engineering.

Aarsha Mithra V
Software Engineer
An aspiring software engineer, Aarsha is passionate about shaping the future of mobility through advancements in ADAS and autonomous driving technologies. With a strong commitment to continuous learning, she is always eager to take on the challenges of this dynamic and rapidly evolving field.