Inside an AI-First Autonomy Stack: LLM, VLM and VLA for Self-Driving Cars
Share
Letting the AI Stack Drive: A Concept Walkthrough of LLM + VLM + VLA for Autonomous Cars
If you’ve seen my short video where I sit in the back of an autonomous car and let it drive through a futuristic city, this post is the “director’s commentary” behind it.
No, there isn’t a production robotaxi picking me up at my door (yet!).
But the stack you see in the video — LLM, VLM, VLA — is very real. It’s the direction a lot of cutting-edge research is moving toward, and it’s the space I’m currently obsessed with.
In this post I’ll walk through the idea behind that concept project: what those three acronyms actually do, how they fit together, and why I think this “AI 2.0” approach to autonomy is so exciting.
From hand-written rules to AI agents
For years, self-driving systems were mostly built from huge piles of hand-crafted logic:
-
If the object is this shape and this distance → treat it as a car.
-
If the light is red and the car is under X meters from the stop line → brake.
-
If a pedestrian is detected in this area → yield.
That approach works up to a point, but it doesn’t scale well to the chaos of the real world. Cities don’t follow your neat decision tree.
The concept I’m exploring in my work flips this around: instead of hard-coding every rule, you train AI agents that can reason, see and plan in a more unified way.
That’s where LLM + VLM + VLA come in.
LLM: the “reasoning brain”
In the stack from my video, the LLM (Large Language Model) is the high-level thinker.
It gets a structured description of what’s happening around the car — objects, lanes, signals, predictions of what other agents might do — and turns that into intent:
-
“We’re in a dense urban street with parked cars on both sides.”
-
“The green light is about to turn yellow.”
-
“A pedestrian is looking like they might cross.”
The LLM doesn’t press the brake itself. Instead, it reasons about goals and constraints:
“We want to stay safe, obey traffic rules, and keep the ride smooth. Given this situation, slowing down and preparing to stop is a good idea.”
Think of it as the strategic brain: it talks in concepts, trade-offs and “why”.
VLM: the eyes that actually understand the scene
If the LLM is the brain, the VLM (Vision-Language Model) is the pair of eyes that actually understand what the sensors see.
Traditional perception models classify objects and draw boxes.
A VLM goes further: it can connect language and vision in a richer way.
In an autonomy context, a VLM could help answer questions like:
-
“Which of these pedestrians looks like they’re about to step off the curb?”
-
“Is that object ahead a plastic bag or a rock?”
-
“Where exactly is the free drivable space in this messy intersection?”
Instead of manually designing hundreds of visual rules, you train the VLM to jointly learn from images, video and text. That’s what allows the rest of the stack to reason about the scene in a more human way.
In the video, those glowing panels labeled VLM represent this “smart vision” layer: constantly interpreting every frame and sending a rich description to the LLM.
VLA: turning intent into motion
Finally, there’s the VLA (Vision-Language Action model) — the part that turns all that understanding into actual motion.
If the LLM decides, “We should change lanes and prepare to turn right,” the VLA is responsible for:
-
generating a safe, smooth trajectory,
-
respecting physical limits of the car,
-
reacting quickly if something unexpected happens.
You can think of the VLA as the hands and reflexes of the system. It consumes both visual context and the LLM’s “plan in words” and outputs a sequence of concrete low-level actions: steering angles, speeds, accelerations.
In my concept project, this is where reinforcement-learning ideas come in: you can train agents in simulation to handle millions of scenarios that are too rare or too dangerous to repeatedly test in the real world.
Why I made the video
So why the cinematic clip of me sitting in the back seat while my own models drive?
Because I think it captures something important:
We’re entering an era where the interesting part of autonomous driving is less about sensors and more about cognition:
-
How well can the system understand the scene?
-
Can it explain its decisions in human-like terms?
-
Can we train it not just to react, but to reason?
The video is a visual metaphor for where this research is heading:
a human trusting a stack of AI agents — LLM, VLM, VLA — to get her home safely.
Today, most of this work still lives in simulation, closed-track experiments and internal prototypes. But even at that stage, working with these models forces you to think differently about autonomy: less like programming a robot, more like coaching a team of specialists.
What I’m personally focusing on
My own contribution in this space is on the model side rather than the hardware:
-
experimenting with ways to feed richer, more structured world models into LLMs,
-
playing with how VLMs can answer “why” questions about scenes,
-
and exploring how planning-style agents (VLAs) can be trained to balance safety and comfort.
I’m not claiming to have solved self-driving (I wish!).
But I’m very interested in this direction where large models, multimodal data and autonomy intersect — and that’s exactly what the video is meant to showcase.
If you came here from TikTok or Instagram and you’re working on similar problems — in autonomous driving, robotics, simulation or large-scale ML — I’d love to connect.
Until then, I’ll be in the back seat, letting the AI stack drive… at least in my experiments for now.