This paper presents the integration of natural language processing and computer vision to improve the syntax of the language generated when describing objects in images. The goal was to not only understand the objects in an image, but the interactions and activities occurring between the objects. We implemented a multi-modal neural network combining convolutional and recurrent neural network architectures to create a model that can maximize the likelihood of word combinations given a training image. The outcome was an image captioning model that leveraged transfer learning techniques for architecture components. Our novelty was to quantify the effectiveness of transfer learning schemes for encoders and decoders to qualify which were the best for improving syntactic relationships. Our work found the combination of ResNet feature extraction and fine-tuned BERT word embeddings to be the best performing architecture across two datasets - a valuable discovery for those continuing this work considering the cost of compute for these complex models.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License