The result is an AI that can learn new vision-language tasks with little or no additional training data
DeepMind recently trained Flamingo, an 80 billion parameter vision-language model (VLM) AI. Flamingo combines separately pre-trained vision and language models and outperforms all other few-shot learning models on 16 vision-language benchmarks. Flamingo can also chat with users, answering questions about input images and videos.
The model was announced by lead researchers Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, and Antoine Miech. Flamingo is based on two previous models developed by DeepMind: Chinchilla, a 70 billion parameter language generation model; and Perceiver, a multimodal classifier model. Flamingo combines these two models into a single neural network, which is then trained on interleaved image and text data sequences.
The result is an AI that can learn new vision-language tasks with little or no additional training data. Models like Flamingo hold great promise to benefit society in practical ways. Flamingo’s abilities pave the way toward rich interactions with learned visual language models that can enable better interpretability and exciting new applications, like a visual assistant which helps people in everyday life.
Multimodal VLMs, such as CLIP, have proven successful at zero-shot learning; however, because such models only provide a score indicating the similarity between an image and a textual description, their range of tasks is limited. Other VLMs, such as DALL-E, can generate photo-realistic images from a description but do not generate language, and so cannot perform tasks such as visual question answering (VQA) or image captioning.
To support both single-frame images and video, the researchers incorporated a Perceiver model that generates a “small fixed number of visual tokens” for both images and videos. This improved the model’s scalability with input size. Finally, the team needed a large combined image-text training dataset. For this, the team scraped text and images from about 43 million web pages to create the MultiModal MassiveWeb (M3W) dataset, which contains 185M images and 182BG of text. Flamingo was trained on a combination of M3W and several other pre-existing image-text datasets.
To evaluate Flamingo, DeepMind tested it on 16 multimodal benchmarks for various tasks, including visual dialogue, VQA, captioning, and image classification. In few-shot learning scenarios, Flamingo outperformed previous best results “by a large margin.” On six of the benchmarks, Flamingo outperformed state-of-the-art fine-tuned models without itself being fine-tuned; instead, Flamingo was used in a few-shot scenario and given only 32 samples, “around 1000 times less” than the fine-tuned models.