Google’s Image-Text AI LIMoE Outperforms CLIP On ImageNet Benchmark

Google's-Image-Text-AI-LIMoE-Outperforms-CLIP-on-ImageNet-Benchmark

LIMoE is based on the Transformer architecture, in which the sequence of input tokens is processed by a series of identical blocks which contain several neural network layers

Researchers at Google Brain recently trained Language-Image Mixture of Experts (LIMoE), a 5.6B parameter image-text AI model. In zero-shot learning experiments on ImageNet, LIMoE outperforms CLIP and performs comparably to state-of-the-art models while using fewer compute resources.

The model and several experiments were described in a paper published on arXiv. LIMoE combines a sparse mixture-of-experts (MoE) scheme with the Transformer architecture, which increases the number of model parameters while maintaining low computational requirements during inference. Unlike CLIP and other “two-tower” image-text models that use separate encoder networks for images and text, LIMoE has a single encoder for both modalities, which has the potential for better scalability and generality.

According to the Google Brain team, “Multimodal models that handle many tasks are a promising route forward, and there are two key ingredients for success: scale, and the ability to avoid interference between distinct tasks and modalities while taking advantage of synergies. Sparse conditional computation is an excellent way of doing both. It enables performant and efficient generalist models with the capacity and flexibility for the specialisation necessary to excel at individual tasks, as demonstrated by LIMoE’s solid performance with less computing.”

The development of LIMoE is part of Google’s Pathways strategy for developing next-generation AI models. One tenet of this effort is sparse neural network models, wherein only a few pathways through the network are activated. This means that using the model for inference requires a fraction of the compute resources and thus the energy—used by a dense model of comparable size. InfoQ recently reported on Google’s PaLM language model, also developed as part of the Pathways project. In 2021, InfoQ reported on Google’s Switch Transformer, a sparse MoE language model that pre-dated the official Pathways announcement but is designed using some of its principles.

LIMoE is based on the Transformer architecture, in which the sequence of input tokens is processed by a series of identical blocks which contain several neural network layers, including an attention layer and a simple feed-forward layer. In LIMoE, the feed-forward layer is replaced by an expert layer which contains parallel feed-forward layers called experts and a router that determines which experts handle a given token.

The Brain team found several challenges in training this model. One challenge, common to all MoE models is to ensure that the model does not collapse; that is, the router does not always choose the same expert. Another challenge to multi-modal data is “modality unbalance”; for example, the dataset may contain much more text than image data. In this case, model collapse can occur for the smaller modality. To remedy these challenges, the team introduced two new training losses: local entropy, which “encourages concentrated router weights,” and global entropy, which results in “diverse expert usage.”

Google has not released the LIMoE model code but suggested that the code would be available on GitHub along with a sparse MoE model for vision within “a few months.”