InternVid - A Large-scale Video-Text Dataset

Training AI systems that can understand videos and language together remains a major challenge. But new research from OpenGVLab introduces InternVid, a massive new dataset that could enable next-generation video-language models. By pre-training on over 234 million video clips totalling 760,000 hours, these models achieve new levels of capability across a range of video-centric tasks.

The key innovation behind InternVid is using a multi-scale video captioning approach to generate high-quality descriptions for the dataset's 7.1 million YouTube videos. This ensures strong correlation between each video and corresponding text, unlike prior datasets reliant on noisy automatic speech recognition. The researchers use both frame-level and clip-level captioning powered by large language models to capture both fine-grained details and high-level video context.

With this rich source of aligned video-text data, the team trains ViCLIP, a transformer-based joint encoder. After pre-training on 50 million InternVid pairs, ViCLIP achieves 75.7% top-1 accuracy on Kinetics-400 for zero-shot action recognition, substantially outperforming prior state-of-the-art video-language models. It also posts leading results on multiple video-text retrieval benchmarks.

Remarkably, ViCLIP matches or exceeds supervised models fine-tuned on target datasets without any task-specific training. This transferability highlights InternVid's diversity and scale. The dataset spans 16 popular categories and thousands of motion labels across 100 languages. Analyses also reveal higher inter-video variation and verb counts compared to alternatives like WebVid.

Beyond basic understanding tasks, InternVid's aligned clips enable multimodal dialogue research. The data provides 7 million interleaved video-text sequences for directly training video chatbots. The team demonstrates ViCLIP's capabilities on interactive spatial reasoning, action recognition and creative tasks when integrated into a dialogue agent.

For generation, the clips allow text-to-video diffusion models to achieve new levels of quality and coherence. On zero-shot UCF-101 evaluation, adding a subset of InternVid aesthetics-filtered clips improves an existing model's FVD by 88 points. Samples conditioned on the same text prompt showcase markedly higher fidelity, detail and consistency.

Of course, web-scraped datasets come with caveats around biases and consent. Diversity does not guarantee fairness or completeness. But InternVid represents an exciting step towards systems that understand and generate video and language together. As model scale increases, aligned data resources become ever more crucial.

Key Facts about InternVid: