Abstract
Image captioning develops a relationship between visual and text information to generate a sequence of words as captions. Transformers perform machine translation and language comprehension together using encoder and decoder structure. With the aim of building a lightweight and production deployment friendly model, we present the Lightweight Transformer with a GRU integrated decoder for Image Captioning. In the presented model, the number of encoders and decoders in standard architecture are reduced to single encoder and GRU integrated decoder. Also, Multi-level rich visual features from incepptionV3 improves single unit encoder encoding performance. To validate the efficiency of the proposed Lightweight Transformer architecture extensive experiments are carried out on MSCOCO image captioning dataset. The model achieves appreciable performance in comparison to other state-of-the-art.

