[Paper Reading] Image Captioning using Deep Neural Architectures (arXiv: 1801.05568v1)

Main Contributions:

A brief introduction about two different methods (retrieval based method and generative method) for image captioning task.
The authors implemented the classical model, Show and Tell, and gave analyses based on the experiments.

Excerpts:

To achieve this goal, Show & Tell model is created by hybridizing two different models. It takes the image as input and provides it into Inception-v3 model. At the end of Inception-v3 model, a single fully connected layer is added. This layer will transform the output of Inception-v3 model into a word embedding vector. We input this word embedding vector into series of LSTM cells.
For any given caption, we add two additional symbols as the start word and stop word. Whenever the stop word is encounted, it stops generating the sentence and it marks end of the string.
Show & Tell model uses Beam Search to find suitable words to generate captions.

巴特西