Training AI systems that can understand videos and language together remains a major challenge. But new research from OpenGVLab introduces InternVid, a massive new dataset that could enable next-generation video-language models. By pre-training on over 234 million video clips totalling 760,000 hours, these models achieve new levels of capability across a range of video-centric tasks.
The key innovation behind InternVid is using a multi-scale video captioning approach to generate high-quality descriptions for the dataset's 7.1 million YouTube videos. This ensures strong correlation between each video and corresponding text, unlike prior datasets reliant on noisy automatic speech recognition. The researchers use both frame-level and clip-level captioning powered by large language models to capture both fine-grained details and high-level video context.
With this rich source of aligned video-text data, the team trains ViCLIP, a transformer-based joint encoder. After pre-training on 50 million InternVid pairs, ViCLIP achieves 75.7% top-1 accuracy on Kinetics-400 for zero-shot action recognition, substantially outperforming prior state-of-the-art video-language models. It also posts leading results on multiple video-text retrieval benchmarks.
Remarkably, ViCLIP matches or exceeds supervised models fine-tuned on target datasets without any task-specific training. This transferability highlights InternVid's diversity and scale. The dataset spans 16 popular categories and thousands of motion labels across 100 languages. Analyses also reveal higher inter-video variation and verb counts compared to alternatives like WebVid.
Beyond basic understanding tasks, InternVid's aligned clips enable multimodal dialogue research. The data provides 7 million interleaved video-text sequences for directly training video chatbots. The team demonstrates ViCLIP's capabilities on interactive spatial reasoning, action recognition and creative tasks when integrated into a dialogue agent.
For generation, the clips allow text-to-video diffusion models to achieve new levels of quality and coherence. On zero-shot UCF-101 evaluation, adding a subset of InternVid aesthetics-filtered clips improves an existing model's FVD by 88 points. Samples conditioned on the same text prompt showcase markedly higher fidelity, detail and consistency.
Of course, web-scraped datasets come with caveats around biases and consent. Diversity does not guarantee fairness or completeness. But InternVid represents an exciting step towards systems that understand and generate video and language together. As model scale increases, aligned data resources become ever more crucial.
Key Facts about InternVid:
Contains 7.1 million YouTube videos totaling 760,000 hours
Videos segmented into 234 million aligned clip-text pairs
Uses multi-scale captioning to ensure video-text correlation
Captions generated via frame and clip-level language models
Covers 16 categories and ~6,000 motion labels
Includes 100+ languages with diversity filtering
Enables training of joint video-text encoder ViCLIP
ViCLIP sets new SOTA on zero-shot action recognition
Also excels on multiple video-text retrieval benchmarks
Matches or beats supervised models without fine-tuning
Analysis shows higher variation and more verbs than WebVid
Provides 7M sequences for video dialogue research
Clips improve text-to-video generation quality significantly
FVD reduced by 88 points on zero-shot UCF-101 benchmark
Samples show higher fidelity, coherence and detail
Caveats around biases and consent with web scraping
Still a major advance for aligned video-language data
Crucial for scaling up multimodal models
Key factors are caption quality and scale
Opens up many future research directions
Link to Paper - https://arxiv.org/pdf/2307.06942v1.pdf
Comments