SMU Data Science Review

Multi-Modal Classification Using Images and Text

Stuart J. Miller, Southern Methodist UniversityFollow
Justin Howard, Southern Methodist UniversityFollow
Paul Adams, Southern Methodist UniversityFollow
Mel Schwan, Southern Methodist UniversityFollow
Robert Slater, Southern Methodist UniversityFollow

Abstract

This paper proposes a method for the integration of natural language understanding in image classification to improve classification accuracy by making use of associated metadata. Traditionally, only image features have been used in the classification process; however, metadata accompanies images from many sources. This study implemented a multi-modal image classification model that combines convolutional methods with natural language understanding of descriptions, titles, and tags to improve image classification. The novelty of this approach was to learn from additional external features associated with the images using natural language understanding with transfer learning. It was found that the combination of ResNet-50 image feature extraction and Universal Sentence Encoder embeddings yielded a Top 5 error rate of 73.05% and Top 1 error rate of 54.65%, which is an improvement of 1.56% on benchmark results. This suggests external text features can be used to aid image classification when they are available.

Recommended Citation

Miller, Stuart J.; Howard, Justin; Adams, Paul; Schwan, Mel; and Slater, Robert (2020) "Multi-Modal Classification Using Images and Text," SMU Data Science Review: Vol. 3: No. 3, Article 6.
Available at: https://scholar.smu.edu/datasciencereview/vol3/iss3/6