Image Captioning with GPT-2
- Category: Transformer-based architecture
- Made with GPT-2, ViT, Python
Experience
The goal of this project is to improve image captioning through exploring transformer-based architectures that combine vision and language. Although previous works have combined image and text with transformers for various tasks, our model provides competetively accurate captioning with the computational efficiency and scalability of transformers. This will aid in the search for efficient multi-modal architectures as well as aid in the important goal to reduce carbon emissions from computing.
We fed raw text and image from our the dataset into our transformer architecture to obtain the desired embeddings.
Then we fed these hidden states into pre-trained GPT2
layers and ViT layers.
Hidden states were then concatenated and applied with multi-head attention with a skipped connection and
batch normalization. We applied the masking strategy shown below so that text-text attention was strictly
causal while also masking image-to-text attention to prevent images from leaking data to text between
layers.
We then sent the hidden state through an MLP with another skipped connection and batch norm.
Finally, the hidden state was deconcatenated to be able to send through the pre-trained layers in the next
round.
This is performed 12 times as the base versions of both GPT2 and ViT-base both have 12 layers.
For generation, the image hidden state is ignored and the text hidden state is passed through the
pretrained GPT2 language modelling head.
This generates a sequence of token logits which are softmaxed, and we perform simple multinomial sampling
on the last token prediction.
This model and architecture gave us the following results:
Metric | Score |
---|---|
Bleu-1 | 63.6% |
Bleu-2 | 42.0% |
Bleu-3 | 27.6% |
CIDEr | 60.3% |