Image Captioning with GPT-2

Category: Transformer-based architecture
Made with GPT-2, ViT, Python

Experience

The goal of this project is to improve image captioning through exploring transformer-based architectures that combine vision and language. Although previous works have combined image and text with transformers for various tasks, our model provides competetively accurate captioning with the computational efficiency and scalability of transformers. This will aid in the search for efficient multi-modal architectures as well as aid in the important goal to reduce carbon emissions from computing.

We fed raw text and image from our the dataset into our transformer architecture to obtain the desired embeddings. Then we fed these hidden states into pre-trained GPT2 layers and ViT layers.

Hidden states were then concatenated and applied with multi-head attention with a skipped connection and batch normalization. We applied the masking strategy shown below so that text-text attention was strictly causal while also masking image-to-text attention to prevent images from leaking data to text between layers.

We then sent the hidden state through an MLP with another skipped connection and batch norm. Finally, the hidden state was deconcatenated to be able to send through the pre-trained layers in the next round. This is performed 12 times as the base versions of both GPT2 and ViT-base both have 12 layers. For generation, the image hidden state is ignored and the text hidden state is passed through the pretrained GPT2 language modelling head. This generates a sequence of token logits which are softmaxed, and we perform simple multinomial sampling on the last token prediction.

This model and architecture gave us the following results:

Metric	Score
Bleu-1	63.6%
Bleu-2	42.0%
Bleu-3	27.6%
CIDEr	60.3%