All Issue

2020 Vol.8, Issue 1 Preview Page

Research Article

31 March 2020. pp. 19-26
Abstract
본 고에서는 LSTM에 기반을 둔 한국어 언어모델에 관한 연구를 수행하였으며 GloVe를 사용하는 LSTM 기반 언어모델을 제안한다. 먼저 PTB 영어 코퍼스를 이용하여 전통적인 n-gram 방식의 통계적인 언어모델과 LSTM 기반의 언어모델을 비교하였고 그 결과 47.3%의 복잡도가 감소되는 효과를 얻었다. 한국어에 적용 실험을 위해서 기본 토큰 단위로 WPM (word-piece model)을 사용하였으며 통계적인 n-gram 언어모델과 LSTM 언어모델을 비교한다. 또한, LSTM 언어모델을 만들 때 GloVe를 단어 표현 벡터로 사용하는 방법을 제안하여 비교 연구도 수행하였다. 한국어 평가 코퍼스 10만 문장을 이용하여 성능 비교를 한 결과 LSTM 방식을 사용하였을 경우 n-gram 방식보다 28.8%의 복잡도가 감소하였고 GloVe와 같이 사용할 경우 43.4%의 복잡도가 감소되었다. 영어와 한국어 코퍼스의 비교 실험으로 GloVe를 사용하는 LSTM 기반 언어모델의 제안이 우수하다는 것을 입증하였다.
In this paper, we make a comparative study on the language model based on long short-term memory (LSTM) and propose a language model based on LSTM using GloVe as a word representation vector. For this purpose, traditional n-gram statistical language model is compared with LSTM language model using PTB English corpus. The experimental result yields that LSTM language model get the perplexity (ppl) reduction of 47.3% compared with traditional n-gram model. In order to expand this approach to Korean language, we design a language model of which basic unit is word-piece model (WPM). And we also make a comparative study of statistical language model and neural language model. Especially, we propose a LSTM language model using glove vector (GloVe) as a word representation vector. For our study, 100,000 Korean sentences are used as a test set. Our experimental result yields that LSTM language model get the reduction of 28.8% compared with n-gram language model and LSTM with GloVe get the reduction of 43.3%. In conclusion, we show that the proposed language model is good approach as a language model.
References
  1. J. Daniel, and H. James, Martin, Speech and Language Processing, Pearson International Edition, Second Edition, 2009.
  2. Ronald Rosenfeld, “Two Decades of Statistical Language Modeling: Where do We Go From Here?,” IEEE Proceedings of the IEEE, No. 8, pp.1270-1278.10.1109/5.880083
  3. CMU statistical language modeling toolkit (SRILM), http://www.speech.sri.com/pipermail/srilm-user/2003q4/000153.html
  4. Zaremba, W. I Sutskever, O.Vinyals, “Recurrent Neural Network Regularization,” arXiv:1409.2329, 2014, arxiv.org
  5. 김양훈외 3인, “LSTM 언어 모델 기반 한국어 문장 생성” , 한국통신학회논문지, 제41권 제5호, pp. 592-601, May, 2016.10.7840/kics.2016.41.5.592
  6. MikeSchuster and Kaisuke Nakajima, “Japanese and Korean Voice Search,” in Proceeding of ICASSP, pp. 5149-5152, 2012.10.1109/ICASSP.2012.6289079
  7. T. Mikolov et al., “Distributed Representation of Words and Phrases and their Compositionality,” In proceedings of NIPS 2013.
  8. J. Pennington, R. Socher, and C. D. Manning,”GloVe:Global Vector for Word Representation,” in Proceeding of EMNLP, pp. 1532-1543, 2014.10.3115/v1/D14-1162
  9. 이선정, “통계적 모델에 기반을 둔 언어모델 적응에 대한 연구”, 한국차세대컴퓨팅학회 논문지, 제12권 제6호 , pp. 34-42, 2016.
  10. S. Hochreiter, and J. Schmidhuber, “Long short-term memory,” Neural Computation, Vol. 9, No. 8, pp. 1735-1780, 1997.10.1162/neco.1997.9.8.17359377276
  11. M . Schuster, and K. K. Paliwai, “Bidirectional Recurrent Neural Networks,” IEEE transaction on Signal Processing, Vol. 45, No. 11, pp. 2673-2681, 1997.10.1109/78.650093
  12. M. Marcus et. Al., “Pen Tree Bank Data,”, Linguistic Data Consortium, University of Pennsylvania, 1999.
  13. R. Kneser, and H. Ney, “Improved Backing-Off for n-gram Language Modeling,” in Proceeding of ICASSP, pp. 181-185, 1995.
  14. Stanley F. Chen and Joshua Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech and Language, pp. 359-394, 1999.10.1006/csla.1999.0128
Information
  • Publisher :The Society of Convergence Knowledge
  • Publisher(Ko) :융복합지식학회
  • Journal Title :The Society of Convergence Knowledge Transactions
  • Journal Title(Ko) :융복합지식학회논문지
  • Volume : 8
  • No :1
  • Pages :19-26