A Study on Machine Learning Model for Image Recognition and Captions

doi:10.22716/sckt.2019.7.2.024

All Issue

2019 Vol.7, Issue 2 Preview Page Next Page

Research Article

A Study on Machine Learning Model for Image Recognition and Captions 이미지 인식과 캡션을 위한 기계학습 모델 연구: Jin-Hee Song^1*, Seung-Hui Jang²
송진희^1*, 장승희²; ¹Shinhan University
²Towson University

¹신한대학교
²타우슨주립대학교

30 June 2019. pp. 43-51

PDF

Abstract

인공지능, 로봇공학, 사물인터넷, 빅 데이터, 자율주행 시스템 등은 4차 산업혁명의 주요 기술 군이다. 이들 기술은 대량의 비정형 데이터(이미지)나 스트리밍 데이터들을 다루게 되며, SNS 사용자들이 발생시키는 비정형 데이터들의 양도 지속적으로 증가하고 있다. 본 논문에서는 인공지능, 딥 러닝이나 비전 처리에서 많이 연구되는 이미지 데이터에 대한 기계학습 모델을 구축하고, 이미지 처리를 위한 인식 및 캡션 생성에 대한 실험을 수행하였다. 제안 모델은 챗봇(페이스 북 앱), 모니터 서버와 모델 서버로 구성되며, 질의 종류는 이미지와 자연어 문장으로 구분하여 모델 서버의 ‘Captioning Model'과 ‘VQA Model'에서 각각 처리한다. 기계학습에 사용한 공개 훈련용 데이터 집합은 2017 MSCOCO이며, 캡션 생성의 성능 실험 결과 perplexity는 8.9의 우수한 결과를 확인할 수 있었다.

Artificial intelligence, Robotics, the Internet of Things, Big data and Self-driving systems are the major technological groups of the fourth industrial revolution. These technologies related to the fourth industrial revolution will deal with large amounts of formalization data(image) or streaming data. Also, the amount of image data occured by SNS users is increasing continuously. In this paper, we constructed a machine learning model for image data processing which is studied in Artificial Intelligence, Deep Learning and Vision Processing. We experimented on recognition and caption generation for image processing. The proposed model consists of three parts which have a Chatbot (Facebook app), a Monitor Server and a Model Server. The query types of Chatbot are classified into image and natural language sentences and processed in 'Captioning Model' and 'VQA Model' of model server respectively. The open training data set used on our machine learning system is 2017 MSCOCO, and the captioning performance is perplexity 8.9. We have obtained the good results from suggested machine learning system.

Keywords

Machine learning model

Dataset

Chatbot

Image recognition

Image captioning

References

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105, 2012.
Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.10.1007/s11263-015-0816-y
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in ECCV, 2010.10.1007/978-3-642-15561-1_2
P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, “Collective generation of natural image descriptions,” in ACL, 2012.
Yang, Zhilin, et al. "Breaking the softmax bottleneck: A high-rank RNN language model." arXiv preprint arXiv:1711.03953 (2017).
Junhua Mao, Wei Xu & Yi Yang & Jiang Wang & Zhiheng Huang, Alan Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” conference of ICLR 2015, 2015.
http://slazebni.cs.illinois.edu/publications/ijcv16_flickr30k.pdf(Bryan A. Plummer, Liwei Wang, Chris M.Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik, “Flickr30k Entities: Collecting Region-to Phrase Correspondences for Ricker Image-to-Sentence Models”)
http://host.robots.ox.ac.uk/pascal/VOC
http://t-robotics.blogspot.com/2016/05/convolutional-neural-network_31.html
Sepp Hochreiter, Jurgen Schmidhuber, “Long Short-Term Memory”, Neural Computation 9(8):1735-1780, 199710.1162/neco.1997.9.8.17359377276

Information

Publisher :The Society of Convergence Knowledge
Publisher(Ko) :융복합지식학회
Journal Title :The Society of Convergence Knowledge Transactions
Journal Title(Ko) :융복합지식학회논문지
Volume : 7
No :2
Pages :43-51
DOI :https://doi.org/10.22716/sckt.2019.7.2.024

[1] Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In NIPS, pp. 1097–1105, 2012.

[2] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-Fei, Li. ImageNet Large Scale Visual Recognition Challenge, 2014.10.1007/s11263-015-0816-y

[3] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, “Every picture tells a story: Generating sentences from images,” in ECCV, 2010.10.1007/978-3-642-15561-1_2

[4] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, “Collective generation of natural image descriptions,” in ACL, 2012.

[5] Yang, Zhilin, et al. "Breaking the softmax bottleneck: A high-rank RNN language model." arXiv preprint arXiv:1711.03953 (2017).

[6] Junhua Mao, Wei Xu & Yi Yang & Jiang Wang & Zhiheng Huang, Alan Yuille, “Deep captioning with multimodal recurrent neural networks (m-rnn),” conference of ICLR 2015, 2015.

[7] http://slazebni.cs.illinois.edu/publications/ijcv16_flickr30k.pdf(Bryan A. Plummer, Liwei Wang, Chris M.Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik, “Flickr30k Entities: Collecting Region-to Phrase Correspondences for Ricker Image-to-Sentence Models”)

[8] http://host.robots.ox.ac.uk/pascal/VOC

[9] http://t-robotics.blogspot.com/2016/05/convolutional-neural-network_31.html

[10] Sepp Hochreiter, Jurgen Schmidhuber, “Long Short-Term Memory”, Neural Computation 9(8):1735-1780, 199710.1162/neco.1997.9.8.17359377276

The Society of Convergence Knowledge Transactions ISSN:2287-8920(Print) 융복합지식학회논문지

All Issue