AI Pioneer Fei-Fei Li Unveils Real-Time Generative 'World Model' Capable of ...|with|real|world|model|tasks|challenges

AI Pioneer Fei-Fei Li Unveils Real-Time Generative 'World Model' Capable of ...

2025-10-17 11:09:10　来源: 钛媒体APP

北京举报

分享至

Fei-Fei Li, Co-founder and CEO of World Labs (Image source: Bloomberg)

TMTPOST -- Fei-Fei Li, the Stanford University computer science professor often hailed as the “Godmother of AI,” has introduced a breakthrough generative model that could redefine how artificial intelligence understands and recreates the physical world.

Li’s startup, World Labs, announced the launch of its Real-Time Frame Model (RTFM) on Oct. 17 — a highly efficient autoregressive diffusion Transformer trained end-to-end on massive video datasets. The model’s key innovation lies in its ability to generate realistic 2D images from new viewpoints using only one or a few input images, without relying on traditional 3D representations.

Within the industry, RTFM is being described as “AI that has learned to render.” The system can simulate physical phenomena such as 3D geometry, reflections, and shadows, and can even reconstruct real-world environments from limited photo data.

According to Li, RTFM can generate persistent, 3D-consistent scenes in real time using a single NVIDIA H100 GPU, paving the way for interactive experiences in both real and imagined virtual spaces.

“Elegant, scalable approaches will ultimately prevail in AI,” Li’s team wrote in an accompanying article. “Generative world models are ideally positioned to benefit from the exponential decline in computing costs that has driven technological progress for decades.”

In response, former Google senior engineer Rui Diao noted that RTFM’s latest breakthrough effectively resolves the long-standing scalability challenges that have hindered world models.

Spatial intelligence refers to the ability of humans or machines to perceive, understand, and interact within three-dimensional space. The concept was first introduced by American psychologist Howard Gardner in his theory of multiple intelligences, describing the brain’s capacity to form a mental model of the external spatial world and manipulate it.

Spatial intelligence enables individuals to think in three dimensions, perceive both external and internal imagery, and recreate, transform, or modify these images. This allows people to navigate environments with ease, manipulate objects at will, and generate or interpret graphical information.

Broadly, spatial intelligence encompasses not only spatial orientation but also visual discrimination and visual reasoning. For machines, it refers to the ability to process visual data in three-dimensional space, make accurate predictions, and act upon them. This allows AI systems to operate and make decisions in complex 3D environments, overcoming the limitations of traditional 2D perception.

Fei-Fei Li has noted that visual capability sparked the Cambrian explosion, and that the evolution of the nervous system gave rise to intelligence. “We want AI that can act, not just see and speak,” she emphasizes.

With the rise of a new generation of generative AI, the combination of spatial intelligence and world models has emerged as a key pathway toward artificial general intelligence (AGI). Advanced world models can reconstruct, generate, and simulate persistent, interactive, and physically accurate environments in real time, poised to transform industries ranging from software to robotics.

Li and her team consider spatial intelligence and world models essential tools for overcoming AI’s technical barriers. Compared with existing technologies, they aim to maintain world model performance while reducing GPU resource requirements and enabling real-time interactions more efficiently.

Under current video architectures, generating a 60-frame-per-second 4K interactive stream would require over 100,000 tokens per second—roughly equivalent to the length of Frankenstein or the first Harry Potter book. Sustaining this for an hour would demand processing more than 100 million contextual tokens, a level neither feasible nor economically viable with today’s infrastructure.

To address this, in March 2025, Li, alongside scholars Ben Mildenhall, Justin Johnson, and Christoph Lassner, founded World Labs and developed RTFM, which delivers three core advantages: efficiency, scalability, and persistence.

Efficiency is demonstrated by the fact that a single NVIDIA H100 GPU can support interactive, frame-rate inference. Scalability is achieved through its end-to-end architecture, which can be continuously optimized as data and computational power grow. Persistence is ensured through pose-aware frame-space memory and context scheduling, allowing world scenes to “never fade away,” enabling long-term, consistent interactions in simulated environments.

In September, World Labs announced it had raised $230 million in funding, led by a16z, NEA, and Radical Ventures. The round also saw participation from the venture arms of AMD, Adobe, Databricks, Shinrai Investments LLC, and NVIDIA Ventures, headed by CEO Jensen Huang.

The company employs around 24 people, including four co-founders, among them Fei-Fei Li, with roughly one-third of the team of Chinese descent. Public reports indicate that World Labs reached a valuation of $1 billion just three months after its founding.

Looking ahead, investors say Fei-Fei Li’s team will first develop a spatial intelligence large model, LWM, designed to deeply understand three-dimensional, physical, spatial, and temporal concepts. The model is expected to support augmented reality applications, before being applied to robotics, improving autonomous vehicles, automated factories, and humanoid robots.

Li has stated that the team aims to launch its first product as early as 2025, while acknowledging that many challenges remain, from business models to technical boundaries. “We are still at the very beginning,” she said, “but we believe our team will overcome these challenges.”

In parallel, Li is also developing the Behavior visual challenge competition, intended to replicate the success of ImageNet, which helped catalyze the deep learning revolution and the broader AI boom. For this reason, Li is widely regarded as a driving force in “enabling AI to truly understand the world.”

The inspiration for Behavior arose from three major challenges in robot learning: the lack of standardized tasks, which makes comparing research difficult; the absence of a unified task framework, with many tasks being short and limited in scope; and a shortage of training data.

This October, Li officially released Behavior 1K, also known as the Behavior 1000 Challenge. It is a comprehensive simulation benchmark and training environment for embodied intelligence and robotics research, including 1,000 long-horizon tasks set in everyday household environments—real-world tasks requiring multiple steps to complete. Behavior provides an open-source training and evaluation platform, allowing researchers worldwide to train algorithms and compare results under consistent standards.

“What excites me even more is that we are at a civilizational turning point: language, spatial, visual, embodied intelligence, and other AI technologies are converging and beginning to truly transform human society,” Li said. “As long as we always keep human-centeredness at heart, these technologies can become a force for good for humanity.”

Li’s team indicated that World Labs will continue to enhance its model’s dynamic scene simulation and user interaction capabilities, and that larger-scale models are expected to deliver even stronger performance in the future.

特别声明：以上内容(如有图片或视频亦包括在内)为自媒体平台“网易号”用户上传并发布，本平台仅提供信息存储服务。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.