World Models Teach AI to See

On Lenny’s recent podcast, Fei-Fei Li called LLMs “wordsmiths in the dark”: eloquent but ungrounded in physical reality. The phrase resonated because it captures exactly what language models can’t do: understand space, navigate environments, predict physics, or reason about the 3D world we inhabit.

I’ve been following world models with growing curiosity. The contrast with LLMs is stark. Where language models learn statistical patterns from text, world models learn by watching: absorbing spatial relationships, temporal dynamics, and cause-effect from video and sensory data. They’re designed to answer the question LLMs fundamentally can’t: what happens next in physical space?

What’s happening now

There is a clear acceleration in 2024-2025. Google’s Genie 2 generates playable 3D worlds from a single image. NVIDIA’s Cosmos trained on 20 million hours of real-world footage, creating physics-aware simulations that companies like Uber and XPENG are deploying.

Meta’s V-JEPA 2 learns 5-6x more efficiently by predicting abstract representations rather than raw pixels.

Fei-Fei Li’s World Labs just launched Marble, the first commercial world model product. The technology she’s building toward: spatial intelligence, AI that understands the physical world the way humans do.

In a recent WSJ profile, Yann LeCun (Meta’s Chief AI Scientist) is telling PhD candidates to focus on world models instead of LLMs. His prediction: world models could replace the LLM paradigm within 3-5 years.

What this could unlock

Autonomous vehicles are the obvious application, but I’m watching a broader pattern. Robotics companies use world models as virtual simulators, training robots in generated scenarios before deploying to reality. Industrial automation benefits from synthetic data generation for rare edge cases.

The shift runs deeper. LLMs process language, world models process reality. One understands how to describe gravity, the other understands falling.

Where this seems to be headed

This feels like 2018-era LLMs: early, expensive, limited to well-funded teams. Genie 2 generates 10-60 seconds of stable video. Cosmos requires massive GPU clusters for training. The sim-to-real gap remains a real challenge: small simulation differences cause real-world failures in safety-critical systems.

But the trajectory is visible. Google formed a new team for world simulation models. NVIDIA is making Cosmos open-source to accelerate the robotics community.

For most companies, there’s no tangible bet to make yet. This technology isn’t accessible enough for broad experimentation. But it’s worth following closely.

World models feel like they’re approaching their ChatGPT moment. GPT-3 existed for years before ChatGPT made it accessible enough to spark the LLM application wave. When world models hit that inflection point, the teams that have been tracking the space will know where to tinker first.

LLMs taught AI to speak. World models are teaching it to see.

What’s happening now

What this could unlock

Where this seems to be headed

Related Posts

Agentic AI Needs APIs to Act

Anthropic’s Quiet Advantage in the AI Race

What a Gigawatt of AI Really Means