Open-Sora 1.1, the latest iteration of the open-source video generation model developed by Colossal-AI, signifies a substantial leap forward in this rapidly evolving field. Building upon the foundation of its predecessor, Open-Sora 1.0, this version introduces significant improvements in capabilities, training efficiency, and overall flexibility.

However, this is just an introduction… More details in:

  1. Addressing Limitations of Open-Sora 1.0
  2. Unveiling the Technical Advancements
  3. Demystifying Bucket System and Masking Strategy
  4. Addressing Training Challenges
  5. Limitations and Future Work
  6. Conclusion

Addressing Limitations of Open-Sora 1.0

While Open-Sora 1.0 offered a promising approach to video generation, it had limitations in terms of video length (capped at 2 seconds) and overall quality. Open-Sora 1.1 tackles these limitations head-on, delivering the following enhancements:

  • Extended Video Length: One of the most notable improvements is the ability to generate videos up to 15 seconds long, a significant increase compared to the 2-second limit of Open-Sora 1.0. This expanded timeframe opens doors for a wider range of creative applications.
  • Variable Output: Open-Sora 1.1 offers more flexibility in video output. It can generate videos in various resolutions (ranging from 144p to 720p) and aspect ratios, catering to diverse project needs.
  • Image Generation: In addition to video generation, Open-Sora 1.1 expands its capabilities to include image generation. This versatility allows users to create high-quality still images alongside videos.
  • Enhanced Prompting: Open-Sora 1.1 introduces advancements in prompting, enabling users to leverage images and videos as prompts for video generation. This functionality unlocks exciting possibilities such as:
    • Animating images: Breathe life into static images by generating short video sequences.
    • Extending generated videos: Seamlessly extend the length of a generated video while maintaining coherence.
    • Video-to-video editing: Edit existing videos by inserting or modifying segments using video prompts.
    • Connecting videos: Create smooth transitions between different video clips for a unified flow.

Unveiling the Technical Advancements

The significant improvements in Open-Sora 1.1 are attributed to several key technical advancements:

  • Increased Model Size and Dataset: Open-Sora 1.1 leverages a significantly larger dataset (10 million videos) compared to Open-Sora 1.0 (400,000 videos). Additionally, the model itself boasts 700 million parameters, potentially contributing to improved video quality.
  • Multi-task Learning: Open-Sora 1.1 employs a multi-tasking approach during training. This allows the model to handle various video properties simultaneously, including resolution, frame length, and aspect ratio. By considering these factors during training, the model can adapt and generate videos with diverse characteristics.
  • Model Architecture Modifications (ST-DiT-2): The developers have introduced ST-DiT-2, a refined version of the original model architecture used in Open-Sora 1.0. This modification promotes better training stability and overall performance, laying the groundwork for improved video generation.

Unveiling the Data Preprocessing Pipeline

Open-Sora 1.1’s success hinges not only on the model’s architecture and training but also on the quality of the data it’s trained on. To ensure high-quality training data, a meticulous data processing pipeline is employed. Here’s a breakdown of the key steps involved:

  1. Raw Video Splitting: The process begins with raw video footage, obtained either from online sources or public datasets. These raw videos are segmented into shorter clips based on scene detection algorithms. This segmentation ensures that each training sample focuses on a coherent scene within the video.
  2. Multi-Score Evaluation: Following segmentation, each video clip undergoes an evaluation process where multiple scores are predicted using pre-trained models. These scores assess various aspects of the video’s quality and suitability for training Open-Sora 1.1.
    • Aesthetic Score: This score gauges the visual appeal of the video clip. Videos deemed aesthetically pleasing by the model are more likely to be selected for training.
    • Optical Flow Score: This score analyzes the motion patterns within the video. Clips with significant motion are more informative for training the model’s ability to generate dynamic video content.
    • Optical Character Recognition (OCR): This step involves detecting and recognizing any text present within the video clip. Textual information can provide valuable context for the model during training, aiding in the generation of semantically consistent videos.
  3. Captioning and Matching Score Calculation: Only video clips that pass the initial evaluation based on aesthetic score, optical flow, and presence of text (if applicable) proceed to the next stage. Here, captions are generated for these shortlisted clips. These captions provide textual descriptions of the video content, further enriching the training data. Additionally, a matching score is calculated to assess the alignment between the generated captions and the actual video content. Videos with a strong correlation between captions and visuals are more valuable for training.
  4. Final Filtering and Camera Motion Detection: In the final stage, video clips are filtered based on the matching score. Clips with weak caption-video alignment are discarded. The remaining clips undergo camera motion detection. This step analyzes camera movements within the video, providing valuable information for the model to learn and generate videos with diverse camera work.

Producing High-Quality Video-Text Pairs: By meticulously processing raw videos through these steps, the data pipeline ensures the creation of high-quality video-text pairs for training Open-Sora 1.1. These video-text pairs boast several key characteristics:

  • High Aesthetic Quality: Videos selected for training are visually appealing, contributing to the generation of aesthetically pleasing outputs.
  • Large Video Motion: The presence of significant motion within the training data equips the model to generate dynamic and engaging videos.
  • Strong Semantic Consistency: Textual information extracted from videos and aligned captions during training foster semantic consistency, enabling the model to generate videos that align well with their descriptions.

In essence, the data processing pipeline, shown in the following image, acts as a crucial filter, selecting and preparing only the most suitable video-text pairs for training Open-Sora 1.1. This meticulous process lays the foundation for the model’s ability to generate high-quality and semantically meaningful videos.

Demystifying Bucket System and Masking Strategy

  • Bucket System for Efficient Multi-Resolution Training: Efficiently training a model on videos with varying resolutions can be computationally demanding. Open-Sora 1.1 tackles this challenge by introducing a bucket system. Videos are grouped into buckets based on their resolution, number of frames, and aspect ratio. This enables efficient processing on GPUs with limited resources by ensuring that videos with similar properties are trained together. The system also incorporates functionalities like keep_prob and batch_size to control the computational cost and balance the GPU load during training.
  • Masking Strategy for Image/Video Conditioning: Open-Sora 1.1 utilizes transformers for image-to-image and video-to-video generation tasks. To guide the generation process based on an image or video prompt, the model employs a masking strategy. Specific frames within the prompt are revealed (unmasked), allowing the model to focus on those elements and incorporate them into the generated video.

Open-Sora 1.0 initially faced challenges when applying this strategy directly to a pre-trained model, in fact, the model struggled to handle frames with different timesteps within a single sample, as it wasn’t trained for this scenario. To address this, Open-Sora 1.1 incorporates a random masking strategy during training. This involves unmasking frames in various combinations (first frame, last frame, random frames, etc.). By exposing the model to diverse masking scenarios during training, it learns to handle frames with different timesteps more effectively when used for image/video conditioning during video generation.

Addressing Training Challenges

The development team behind Open-Sora 1.1 acknowledges the limitations imposed by resource constraints during training. These limitations necessitated careful monitoring and adjustments to the training strategy throughout the process. Here’s a breakdown of the training details and the reasoning behind specific choices:

Dataset Limitations:

  • Originally Planned Dataset: The team initially aimed to utilize a much larger dataset, potentially reaching 30 million videos (panda-70M and additional data).
  • Limited Preprocessing: Disk I/O bottlenecks hampered complete data processing, resulting in a smaller, 10 million video dataset for training.

Training Details:

  • Fine-tuning: The training process began with fine-tuning the model on images of different resolutions for 6k steps, leveraging checkpoints from Pixart-alpha-1024. This demonstrated the model’s ability to adapt to generating images with varying resolutions.
  • SpeeDiT for Accelerated Diffusion Training: SpeeDiT (a diffusion training acceleration algorithm) was employed to expedite the diffusion training process.
  • Multi-Stage Training: The pre-training phase involved multiple stages, each with distinct configurations:
    • Stage 1:
      • Gradient checkpointing was used for 24k steps (approximately 4 days on 64 H800 GPUs). While the number of samples seen remained constant, a smaller batch size was found to be more effective at this early stage.
      • The majority of videos used were in 240p resolution. Although video quality appeared acceptable, temporal knowledge seemed limited.
      • A mask ratio of 10% was used.
      • To address the limitations observed, several adjustments were made:
        • Switching to a smaller batch size without gradient checkpointing.
        • Adding fps conditioning.
        • Training for 40k steps (approximately 2 days).
        • Utilizing a lower resolution (144p) based on findings from Open-Sora 1.0 suggesting the model can learn temporal knowledge with lower resolution videos.
        • Increasing the mask ratio to 25% as image conditioning wasn’t performing well.
        • Adopting QK-normalization for training stability, inspired by SD3 and the model’s observed quick adaptation.
        • Switching from iddpm-speed to iddpm.
    • Stage 2 & 3:
      • These stages focused on progressively higher video resolutions:
        • Stage 2: Primarily used 240p and 480p videos.
        • Stage 3: Primarily used 480p and 720p videos.
      • Each stage involved training for approximately one day using all pre-training data.
      • The final stage benefited from loading the optimizer state from the previous stage, facilitating faster learning.

Overall training of Open-Sora 1.1 required roughly 9 days on 64 H800 GPUs.

Limitations and Future Work

The developers of Open-Sora 1.1 acknowledge several limitations in the current iteration, highlighting areas for future improvement:

  • Generation Failure: In certain cases, particularly when dealing with complex content or a large number of tokens, the model fails to generate the desired scene. Potential causes include temporal attention collapse and a bug identified in the code. The team is actively working on a fix and plans to increase model size and training data volume in future versions to enhance generation quality.
  • Noisy Generation and Fluency: The generated videos can exhibit noise and lack fluency, especially for longer videos. This is attributed to the absence of a temporal VAE (Variational Autoencoder). Inspired by Pixart-Sigma’s findings regarding the ease of adapting to a new VAE, the developers plan to incorporate a temporal VAE in the next version.
  • Lack of Time Consistency: Maintaining consistency across video frames, particularly in longer videos, remains a challenge. The limited training FLOPs (Floating-point Operations) are believed to be a contributing factor. The team plans to address this by collecting more data and extending the training process.
  • Poor Human Video Generation: The model struggles with generating high-quality videos featuring humans. This is likely due to the limited amount of human data used for training. The developers plan to collect more human data and fine-tune the model for improved human video generation.
  • Low Aesthetic Score: The current aesthetic quality of generated videos is not optimal. The lack of aesthetic score filtering during training, hampered by I/O bottlenecks, is considered a potential culprit. The team plans to implement data filtering based on aesthetic scores and fine-tune the model to generate more aesthetically pleasing videos.
  • Degraded Quality for Longer Videos: The quality of generated videos tends to decrease with increasing video length for the same prompt. This suggests that the model struggles to maintain image quality consistently across different video lengths. The developers plan to address this by refining the model’s ability to adapt to varying sequence lengths.

Conclusion

Open-Sora 1.1 represents a significant leap forward in the democratization of video generation technology. In fact, by offering extended video length, variable output formats, enhanced prompting capabilities, and improved training efficiency, it empowers users with greater creative freedom. The technical advancements, including a larger model and dataset, multi-tasking learning, and a refined architecture, lay the groundwork for further improvements. While limitations remain, the developers’ commitment to addressing them through strategies like incorporating a temporal VAE, collecting more diverse training data, and refining training processes paves the way for even more robust and versatile video generation in future iterations. Open-Sora 1.1 serves as a stepping stone towards a future where high-quality video creation becomes accessible to a broader audience. If you don’t believe me, try to create some video yourself with the HuggingFace demo!

Subscribe for the latest breakthroughs and innovations shaping the world!

Leave a comment

Design a site like this with WordPress.com
Get started