Adobe Researchers Open-Source Image Captioning AI CLIP-S

Adobe-Researchers-Open-Source-Image-Captioning-AI-CLIP-S

CLIP-S uses a Transformer model to generate captions given an input image

Researchers from Adobe and the University of North Carolina (UNC) have open-sourced CLIP-S, an image-captioning AI model that produces fine-grained descriptions of images. In evaluations with captions generated by other models, human judges preferred those generated by CLIP-S most of the time.

CLIP-S uses a Transformer model to generate captions given an input image. During training, the model uses CLIP to determine how well the generated caption describes the image; this score is used as a reward signal for reinforcement learning (RL). To improve the grammar of the generated captions, the team fine-tuned CLIP with negative caption examples, which were generated by randomly modifying reference captions. The team also developed a new benchmark dataset, FineCapEval, which includes more fine-grained image captions describing image backgrounds and relations between objects to address the shortcomings of existing image-captioning evaluation methods.

According to the research team,the reference captions of public datasets often describe only the most prominent objects in the images. This makes models trained to maximise textual similarity with reference captions tend to generate less distinctive captions that ignore the finely detailed aspects of an image that distinguishes it from others.

Many image captioning models are trained on datasets consisting of input images and reference captions; the training objective measures the similarity of the generated caption to the reference caption, using metrics such as BLEU. However, this often results in models that generate generic captions that describe only the prominent objects in the image, ignoring fine details that make the image distinctive.

To address this problem, the Adobe team chose to use OpenAI’s CLIP model to measure the accuracy of the generated captions. CLIP measures the similarity between an image and a text string; the more closely the text describes the image, the higher the similarity. The researchers used this CLIP score to create a reward function, CLIP-S, for RL training to produce their captioning model.

However, the team found that this model often generated grammatically incorrect captions, for example, by repeating words: “several rows of planes parked outside a terminal window area with fog outside a terminal window motion position area motion.” Their solution was to fine-tune the text-encoder portion of CLIP, by providing negative examples with randomly repeated, inserted, or shuffled tokens. They also introduced a two-layer perceptron classifier head that detects whether a sentence is grammatically correct, training this jointly with the text-encoder fine-tuning.

The team also created FineCapEval, a new benchmark dataset for evaluating fine-grained image captioning models. This dataset contains 500 images from the MS COCO test split and the Conceptual Captions validation split. For each image, five human workers wrote descriptions of the image background; the objects in the image, including shape and colour; the relationships among the objects, such as spatial relationships; and a detailed caption including all the first three aspects. The dataset contains a total of 1k images with 5k captions for each of those four criteria.

The team compared its captions to those from several baseline models to evaluate their model, using the COCO dataset as a benchmark. Although a baseline model outperformed CLIP-S on text-based metrics such as BLEU, CLIP-S outperformed on image-text-based metrics and text-to-image retrieval metrics. It also “significantly” outperformed baselines on the team’s new FineCapEval benchmark. Finally, human judges “strongly” preferred CLIP-S-generated captions to those generated by baseline models.

The CLIP-S code and the FineCapEval dataset are available on GitHub.