A Study on Korean Language Model Based on Attention

doi:10.22716/sckt.2020.8.4.033

All Issue

2020 Vol.8, Issue 4 Preview Page Next Page

Research Article

A Study on Korean Language Model Based on Attention Attention에 기반한 한국어 언어모델 연구: Sunjeong Lee¹
이선정¹; ¹Professor, Incheon National University

¹국립인천대학교 컴퓨터공학부 교수

31 December 2020. pp. 31-38

PDF

Abstract

본 논문에서는 어텐션(atention)에 기반을 둔 한국어 언어모델에 관한 연구를 수행하였다. 대표적인 어텐션 모델로 셀프 어텐션(self-attention)이 가능한 트랜스포머(transformer)가 있다. 트랜스포머는 인코더와 디코더로 구성이 되는데 언어모델로는 디코더를 일반적으로 사용한다. 한국어에 적용 실험을 하기 위해서 기본 토큰 단위로 센텐스피스(SentencePiece)를 사용하여 구하였다. AI-Hub 한국어 평가 코퍼스 60만 문장을 이용하여 성능 비교를 한 결과 5,000개의 센텐스피스 토큰을 사용한 것이 10,000개의 센텐스피스 토큰을 사용한 것과 비교하였을 경우 33.4%의 복잡도가 감소하였다. 또한 한국어 음성인식 실험을 통하여 복잡도 성능이 우수한 5,000개의 센텐스피스 토큰을 갖는 언어모델의 성능이 우수하다는 것을 보였다.

In this paper, we make a study on the language model based on attention. The representative attention model is a transformer model, which enables a self-attention. Even though the transformer model consists of encoder and decoder, decoder is usually used for language model. We build a sentence-piece model for tokenizing. The experimental result yields that the token unit number of 5000 gets the perplexity(ppl) reduction of 33.4% compared with that of 10,000 when AI-Hub corpus (https://www.aihub.org.kr) is used in the sentence-piece model. In order to prove the performance of language model with regard to perplexity, we make an experiment of speech recognition that the model with low perplexity yields better performance than that with high perplexity.

Keywords

Language model

Language model based on DNN

Transformer for language model

Sentence-piece tokenizer

References

W. Zaremba, I Sutskever, O.Vinyals, "Recurrent Neural Network Regularization," arXiv:1409.2329, 2014, arxiv.org
김양훈 외 3인 "LSTM 언어모델 기반 한국어 문장생성", 한국통신학회 논문지, 제41권 제5호, pp.592-601, 2016.https://doi.org/10.7840/kics.2016.41.5.592
CMU statistical language modeling toolkit(SRILM), http://www.speech.sri.com/pipermail/srilm-user/2003q4/000153.html
이선정, "Long Short-Term Memory에 기반한 한국어 언어모델 연구", 융복합지식학회 논문지, 제 8권 제1호, pp.19-26, 2020.
Ashish Vaswani, et al. "Attention is All You Need," In proceddings of NIPS 2017.
Jacob Devlin, et al., "BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding," In proceedings of NAACL, pp.171-4186, 2019.
Alec Radford, et al.,"Language Models are Unsupervised Multitask Learners," 2019, openai.com
Tom B. Brown, et al.,"Language Models are Few-Shot Learners," arXiv:2005.14165, May, 2020.
MikeSchuster and Kaisuke Nakajima, "Japanese and Korean Voice Search,", in Proceeding of ICASSP, pp.5149-5152, 2012.https://doi.org/10.1109/ICASSP.2012.6289079
Taku Kudo and John Richardson, "SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing," In proceedings of EMNLP, pp. 66-71, Oct. 2018.https://doi.org/10.18653/v1/D18-2012PMid:29382465
Dzmitry Bahdanau, et al.,"Neural Machine translation by Jointly Learning to Align and Translate," CoRR. Abs/1703.03906, 2017.

Information

Publisher :The Society of Convergence Knowledge
Publisher(Ko) :융복합지식학회
Journal Title :The Society of Convergence Knowledge Transactions
Journal Title(Ko) :융복합지식학회논문지
Volume : 8
No :4
Pages :31-38
DOI :https://doi.org/10.22716/sckt.2020.8.4.033

[1] W. Zaremba, I Sutskever, O.Vinyals, "Recurrent Neural Network Regularization," arXiv:1409.2329, 2014, arxiv.org

[3] 김양훈 외 3인 "LSTM 언어모델 기반 한국어 문장생성", 한국통신학회 논문지, 제41권 제5호, pp.592-601, 2016.https://doi.org/10.7840/kics.2016.41.5.592

[5] CMU statistical language modeling toolkit(SRILM), http://www.speech.sri.com/pipermail/srilm-user/2003q4/000153.html

[7] 이선정, "Long Short-Term Memory에 기반한 한국어 언어모델 연구", 융복합지식학회 논문지, 제 8권 제1호, pp.19-26, 2020.

[9] Ashish Vaswani, et al. "Attention is All You Need," In proceddings of NIPS 2017.

[11] Jacob Devlin, et al., "BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding," In proceedings of NAACL, pp.171-4186, 2019.

[13] Alec Radford, et al.,"Language Models are Unsupervised Multitask Learners," 2019, openai.com

[15] Tom B. Brown, et al.,"Language Models are Few-Shot Learners," arXiv:2005.14165, May, 2020.

[17] MikeSchuster and Kaisuke Nakajima, "Japanese and Korean Voice Search,", in Proceeding of ICASSP, pp.5149-5152, 2012.https://doi.org/10.1109/ICASSP.2012.6289079

[19] Taku Kudo and John Richardson, "SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing," In proceedings of EMNLP, pp. 66-71, Oct. 2018.https://doi.org/10.18653/v1/D18-2012PMid:29382465

[21] Dzmitry Bahdanau, et al.,"Neural Machine translation by Jointly Learning to Align and Translate," CoRR. Abs/1703.03906, 2017.

The Society of Convergence Knowledge Transactions ISSN:2287-8920(Print) 융복합지식학회논문지

All Issue