All Issue

2024 Vol.12, Issue 3 Preview Page

Research Article

30 September 2024. pp. 125-133
Abstract
본 논문은 추출 요약 기반의 한국어 문서 요약 시스템에서 많이 적용되고 있는 TextRank 알고리즘에서 문장 유사도를 계산하는 다양한 방식들의 성능을 비교 분석하였다. 성능 평가를 위한 데이터셋은 AI 허브에서 제공하는 뉴스 기사와 사설을 적용하여, 문서 종류에 따른 영향을 분석하였다. 그리고, 문서 내의 문장의 개수에 따라 데이터셋을 2가지 유형으로 구분하여 문서의 길이에 대한 영향을 분석하였다. 루지 스코어는 LSA 방식과 TF-IDF 방식에서 가장 높았고, 문서 길이에 따른 영향은 거의 없는 것으로 나타났다. 또한, 요약문 생성시 소요되는 실행시간은 LSA 방식이 가장 짧고, TF-IDF 방식에 비해 1/3로 단축된다. 이는 LSA 방식이 차원 축소를 통한 계산량 감소뿐만 아니라 의미적 유사도를 고려할 수 있기 때문으로 판단된다. 결과적으로 TextRank 기반의 추출식 한국어 문서 요약 시스템에서는 LSA 방식이 가장 우수함을 확인하였다.
This paper compares and analyzes the performance of various methods for calculating sentence similarity in the TextRank algorithm, which is widely used in extractive Korean document summarization systems. The dataset of performance evaluation is applied to news articles and editorials of AI Hub, and the effect according to document type is analyzed. And, the dataset is divided into two types according to the number of sentences in the document, and the effect on the length of the document is analyzed. The experimental results shows that Rouge score is the highest in the LSA and TF-IDF methods, and the effect of document length is little. The execution time of summarization is the shortest in the LSA method, which is shortened to 1/3 compared to the TF-IDF method. This is the reason that the LSA method can reduce the calculation amount through dimension reduction as well as reflect semantic similarity. Finally, we confirmed that the LSA method is the best in the extractive Korean document summarization based on TextRank.
References
  1. ESTtech, "AI 기반 뉴스 3줄 요약 서비스 개발기", https://blog.est.ai/2021/06/news-summary/

  2. V. Gulati, D. Kumar, D. E. Popescu, and J. D. Hemanth, "Extractive Article Summarization Using Integrated TextRank and BM25+ Algorithm", MDPI Electronics, Vol. 12, No. 2, pp. 372-388, Jan. 2023.

    10.3390/electronics12020372
  3. J. Park, and J. Kim, H. Lee, "Designing Baseline for Korean Document Summarization using BERT-based Pre-trained Encoder", The Journal of Korean Institute of Information Technology, Vol. 20, No. 6, pp. 19-32, Jun. 2022.

    10.14801/jkiit.2022.20.6.19
  4. D. Lee, M. Shin, T. Whang, S. Cho, B. Ko, D. Lee, E. Kim, and J. Jo, "Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization", Proceedings of the 28th International Conference on Computational Linguistics, pp. 5604-5616, Dec. 2020.

    10.18653/v1/2020.coling-main.491
  5. B. Moon, and H. Lim, E. Park, "Design and Implementation of News Article Summarization Application using Extraction Summarization Techniques", The Journal of Korean Institute of Information Technology, Vol. 22, No. 5, pp. 193-203, May 2024.

    10.14801/jkiit.2024.22.5.193
  6. R. Mihalcea, and P. Tarau, "Textrank: Bringing Order into Text", Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404-411, 2004.

    10.3115/1220575.1220627
  7. S. Yun, and K. Han, "A Study on Patent Data Analysis and Competitive Advantage Strategy using TF-IDF and Network Analysis", Journal of Digital Contents Society, Vol. 19, No. 3, pp. 529-535, 2018.

  8. D. Lee, S. Baek, M. Park, J. Park, H. Jung, and J. Lee, "Document Summarization Using Mutual Recommendation with LSA and Sense Analysis", Journal of Korean Institute of Intelligent Systems, Vol. 22, No. 5, pp. 656-662, Oct. 2012.

    10.5391/JKIIS.2012.22.5.656
  9. C. Hong, and K. Hur, "Prognosis of the Remaining Useful Life of a Turbofan Engine Using Deep Neural Network and Pearson Correlation Coefficient", Journal of the KNST, Vol. 4, No. 1, pp. 78-83, Mar. 2021.

    10.31818/JKNST.2021.03.4.1.78
  10. AI Hub, https://www.aihub.or.kr/aihubdata/data/list.do?currMenu=115&topMenu=100

Information
  • Publisher :The Society of Convergence Knowledge
  • Publisher(Ko) :융복합지식학회
  • Journal Title :The Society of Convergence Knowledge Transactions
  • Journal Title(Ko) :융복합지식학회논문지
  • Volume : 12
  • No :3
  • Pages :125-133