Google’s New Imagen AI Outperforms DALL-E On Text-to-Image Generation Benchmarks

Google's-New-Imagen-AI-Outperforms-DALL-E-on-Text-to-Image-Generation-Benchmarks

Imagen outperforms DALL-E 2 on the COCO benchmark and is pre-trained only on text data, unlike many similar models

Researchers from Google’s Brain Team have announced Imagen, a text-to-image AI model that can generate photorealistic images of a scene given a textual description. Imagen outperforms DALL-E 2 on the COCO benchmark and is pre-trained only on text data, unlike many similar models.

The model and several experiments were described in a paper published on arXiv. Imagen uses a Transformer language model to convert the input text into a sequence of embedding vectors. A series of three diffusion models convert the embeddings into a 1024×1024 pixel image. As part of their work, the team developed an improved diffusion model called Efficient U-Net and a new benchmark suite for text-to-image models called DrawBench. On the COCO benchmark, Imagen achieved a zero-shot FID score of 7.27, outperforming DALL-E 2, the previous best-performing model.

Instead of using an image-text dataset for training Imagen, the Google team simply used an “off-the-shelf” text encoder, T5, to convert input text into embeddings. Imagen uses a sequence of diffusion models to convert the embedding into an image. These generative AI models use an iterative denoising process to convert Gaussian noise into samples from a data distribution; in this case, images. Google also developed a new deep-learning architecture called Efficient U-Net, which is “simpler, converges faster, and is more memory efficient” than previous U-Net implementations.

In addition to evaluating Imagen on the COCO validation set, the researchers developed a new image-generation benchmark, DrawBench. The benchmark consists of a collection of text prompts “designed to probe different semantic properties of models,” including composition, cardinality, and spatial relations. DrawBench uses human evaluators to compare two different models.

First, each model generates images from the prompts. Then, the evaluators compare the results from the two, indicating which model produced the better image. Using DrawBench, the Brain team evaluated Imagen against DALL-E 2 and three other similar models; the team found that the judges “exceedingly” preferred the images generated by Imagen over the other models.