top of page

InternVid - A Large-scale Video-Text Dataset

Training AI systems that can understand videos and language together remains a major challenge. But new research from OpenGVLab introduces InternVid, a massive new dataset that could enable next-generation video-language models. By pre-training on over 234 million video clips totalling 760,000 hours, these models achieve new levels of capability across a range of video-centric tasks.




The key innovation behind InternVid is using a multi-scale video captioning approach to generate high-quality descriptions for the dataset's 7.1 million YouTube videos. This ensures strong correlation between each video and corresponding text, unlike prior datasets reliant on noisy automatic speech recognition. The researchers use both frame-level and clip-level captioning powered by large language models to capture both fine-grained details and high-level video context.


With this rich source of aligned video-text data, the team trains ViCLIP, a transformer-based joint encoder. After pre-training on 50 million InternVid pairs, ViCLIP achieves 75.7% top-1 accuracy on Kinetics-400 for zero-shot action recognition, substantially outperforming prior state-of-the-art video-language models. It also posts leading results on multiple video-text retrieval benchmarks.


Remarkably, ViCLIP matches or exceeds supervised models fine-tuned on target datasets without any task-specific training. This transferability highlights InternVid's diversity and scale. The dataset spans 16 popular categories and thousands of motion labels across 100 languages. Analyses also reveal higher inter-video variation and verb counts compared to alternatives like WebVid.


Beyond basic understanding tasks, InternVid's aligned clips enable multimodal dialogue research. The data provides 7 million interleaved video-text sequences for directly training video chatbots. The team demonstrates ViCLIP's capabilities on interactive spatial reasoning, action recognition and creative tasks when integrated into a dialogue agent.

For generation, the clips allow text-to-video diffusion models to achieve new levels of quality and coherence. On zero-shot UCF-101 evaluation, adding a subset of InternVid aesthetics-filtered clips improves an existing model's FVD by 88 points. Samples conditioned on the same text prompt showcase markedly higher fidelity, detail and consistency.

Of course, web-scraped datasets come with caveats around biases and consent. Diversity does not guarantee fairness or completeness. But InternVid represents an exciting step towards systems that understand and generate video and language together. As model scale increases, aligned data resources become ever more crucial.


Key Facts about InternVid:

  1. Contains 7.1 million YouTube videos totaling 760,000 hours

  2. Videos segmented into 234 million aligned clip-text pairs

  3. Uses multi-scale captioning to ensure video-text correlation

  4. Captions generated via frame and clip-level language models

  5. Covers 16 categories and ~6,000 motion labels

  6. Includes 100+ languages with diversity filtering

  7. Enables training of joint video-text encoder ViCLIP

  8. ViCLIP sets new SOTA on zero-shot action recognition

  9. Also excels on multiple video-text retrieval benchmarks

  10. Matches or beats supervised models without fine-tuning

  11. Analysis shows higher variation and more verbs than WebVid

  12. Provides 7M sequences for video dialogue research

  13. Clips improve text-to-video generation quality significantly

  14. FVD reduced by 88 points on zero-shot UCF-101 benchmark

  15. Samples show higher fidelity, coherence and detail

  16. Caveats around biases and consent with web scraping

  17. Still a major advance for aligned video-language data

  18. Crucial for scaling up multimodal models

  19. Key factors are caption quality and scale

  20. Opens up many future research directions


Link to Paper - https://arxiv.org/pdf/2307.06942v1.pdf

Comments


bottom of page