AI在线 AI在线

New Breakthrough in Real-Time Video Generation: Meta StreamDiT Can Generate High-Quality Videos Frame by Frame with a Single GPU

Meta and researchers from the University of California, Berkeley have developed StreamDiT, a revolutionary AI model that can create 512p resolution videos in real-time at 16 frames per second, requiring only a single high-end GPU. Unlike previous methods that needed to fully generate a video clip before playback, StreamDiT enables real-time video stream generation frame by frame.The StreamDiT model has 4 billion parameters and demonstrates impressive versatility.

Meta and researchers from the University of California, Berkeley have developed StreamDiT, a revolutionary AI model that can create 512p resolution videos in real-time at 16 frames per second, requiring only a single high-end GPU. Unlike previous methods that needed to fully generate a video clip before playback, StreamDiT enables real-time video stream generation frame by frame.

The StreamDiT model has 4 billion parameters and demonstrates impressive versatility. It can instantly generate videos up to one minute long, respond to interactive prompts, and even edit existing videos in real-time. In an impressive demonstration, StreamDiT successfully replaced a pig in a video with a cat in real-time while keeping the background unchanged.

Custom Architecture for Exceptional Speed

The core of the system lies in its custom architecture designed for speed. StreamDiT uses a mobile buffer technique, allowing it to process multiple frames simultaneously, processing the next frame while outputting the previous one. New frames start off noisy but gradually improve until they are ready to be displayed. According to the research paper, the system can generate two frames in about half a second, which after processing results in eight final images.

StreamDiT divides its buffer into fixed reference frames and short blocks. During the denoising process, image similarity gradually decreases, forming the final video frames.

Multi-functional Training and Acceleration Techniques

To enhance the model's generality, StreamDiT's training process covered various video creation methods, using 3,000 high-quality videos and a large dataset containing 2.6 million videos. The training was conducted on 128 Nvidia H100 GPUs, and researchers found that using block sizes ranging from 1 to 16 frames yielded the best results.

To achieve real-time performance, the team introduced a key acceleration technique, reducing the required computational steps from 128 to just 8, while minimizing the impact on image quality. StreamDiT's architecture is also optimized for efficiency, with information exchanged only between local regions rather than between every image element.

Performance Exceeding Existing Methods

In direct comparison tests, StreamDiT outperformed existing methods such as ReuseDiffuse and FIFO diffusion when handling videos with a lot of motion. Other models tend to create static scenes, while StreamDiT can generate more dynamic and natural motion.

Human evaluators assessed StreamDiT's performance in terms of action smoothness, animation completeness, frame-to-frame consistency, and overall quality. When tested on an 8-second, 512p video, StreamDiT ranked first in all categories.

Potential of Larger Models and Current Limitations

The research team also tried a larger 30 billion parameter model, which provided higher video quality, although its speed was still insufficient for real-time use. This suggests that StreamDiT's approach can be scaled to larger systems, indicating the potential for future high-quality real-time video generation.

Despite significant progress, StreamDiT still has some limitations. For example, it has limited "memory" of the first half of the video, and visible transitions may occasionally appear between different parts. Researchers stated that they are actively researching solutions to overcome these challenges.

Notably, other companies are also exploring the field of real-time AI video generation. For example, Odyssey recently launched a self-regressive world model that can adjust videos frame by frame based on user input, providing a more convenient interactive experience.

The emergence of StreamDiT marks an important milestone in AI video generation technology, signaling a broad future for real-time interactive video content creation.

相关资讯

视频生成的测试时Scaling时刻!清华开源Video-T1,无需重新训练让性能飙升

视频作为包含大量时空信息和语义的媒介,对于 AI 理解、模拟现实世界至关重要。 视频生成作为生成式 AI 的一个重要方向,其性能目前主要通过增大基础模型的参数量和预训练数据实现提升,更大的模型是更好表现的基础,但同时也意味着更苛刻的计算资源需求。 受到 Test-Time Scaling 在 LLM 中的应用启发,来自清华大学、腾讯的研究团队首次对视频生成的 Test-Time Scaling 进行探索,表明了视频生成也能够进行 Test-Time Scaling 以提升性能,并提出高效的 Tree-of-Frames 方法拓展这一 Scaling 范式。
3/26/2025 1:07:00 PM
机器之心

New BeanPod Video Generation Model to Be Released Tomorrow with Support for Seamlessly Multi-Camera Narration and Other Functions

Tomorrow, the 2025 FORCE Original Power Conference will be held in grand style. During the conference, the capability upgrade of the DouBao large model family will be unveiled. At the same time, the highly anticipated new DouBao · Video Generation Model will also be officially released.According to reports, the new DouBao · Video Generation Model has several outstanding features.
6/16/2025 9:49:01 AM
AI在线

Video Version of AI Clothes Swapping Framework MagicTryOn Based on Wan2.1 Video Model

In the modern fashion industry, Video Virtual Try-On (VVT) has gradually become an important component of user experience. This technology aims to simulate the natural interaction between clothing and human body movements in videos, showcasing realistic effects during dynamic changes. However, current VVT methods still face multiple challenges such as spatial-temporal consistency and preservation of clothing content.To address these issues, researchers proposed MagicTryOn, a virtual try-on framework based on a large-scale video diffusion transformer (Diffusion Transformer).
6/16/2025 12:01:13 PM
AI在线
  • 1