A Study on the Multiple Imputation of Missing Values: Focus on Fine Dust Data

doi:10.22716/sckt.2020.8.4.044

All Issue

2020 Vol.8, Issue 4 Preview Page

Research Article

A Study on the Multiple Imputation of Missing Values: Focus on Fine Dust Data 결측치 다중 대체에 대한 연구: 미세먼지 자료를 중심으로: Jaehyun Kim¹
김재현¹; ¹Department of Computer Engineering, Seokyeong University

¹서경대학교 컴퓨터공학과

31 December 2020. pp. 149-156

PDF

Abstract

4차 산업혁명의 빅이슈는 빅데이터와 인공지능이다. IoT 센서, SNS, 금융 거래 등으로부터 다양한 형태의 데이터가 자동으로 생성, 저장, 처리, 분석되어 사용되고 있다. 하지만 결측치를 포함하고 있는 불완전한 데이터를 이용한 자료 분석은 편의된 추정치와 그로 인한 잘못된 분석 결과를 발생시킨다. 본 연구에서는 결측치 메커니즘과 결측치 패턴에 대해 알아보고 고전적인 결측치 대체 방법과 다중 대체 방법에 대해 고찰하였다. 실증 연구로 결측치를 포함하고 있는 48,192개의 미세먼지(PM10)와 초미세먼지(PM 2.5) 시계열 데이터에 대해 통계 언어인 R의 MICE 패키지를 통한 다중 대체 방법을 대체된 데이터셋의 수를 변화시키며 시뮬레이션을 실행하였다. 그 결과 20개의 대체된 데이터셋을 이용한 대체 방법이 적합한 것으로 판단되었으며 이를 결측치를 제거한 원자료와 비교하여 원자료가 가지고 있는 모수와 일치하는 것을 입증하였다.

Big issues of the Fourth Industrial Revolution are big data and artificial intelligence. Various types of data that are automatically created, stored, processed and analyzed from IoT sensors, SNS, and financial transactions are using for accurate prediction . However, data analysis using incomplete data containing missing values results in biased estimates and resulting erroneous analyses. In this study, the missing values mechanism and the missing values pattern were explored and the classical method of replacing missing values and the multiple imputation methods were considered. Empirical studies conducted simulations of 48,192 fine dust (PM10) and ultrafine dust (PM 2.5) time series data containing missing values, varying the number of data sets replaced by the MICE package of R, the statistical language. As a result, the mutiple imputation method using 20 imputed datasets was judged to be appropriate and were compared with the original datasets that removed missing values to show that the parameters of the original datasets were consistent.

Keywords

Missing values

Multiple imputation

Big data

Particulate matter(PM) data

MICE

References

Little, R. J. A., and D. B. Rubin., "Statistical Analysis with Missing Data : 2nd Edition" , John Wiley & Sons, 2002.https://doi.org/10.1002/9781119013563
김연진, 박헌진, "미세먼지 자료에서의 결측치 대체 방법 비교", 한국빅데이터학회지, 제4권 제2호, pp.105-110, 2019.https://doi.org/10.36498/kbigdt.2019.4.2.105
Rubin, D. B., "Inference and Missing Data", Biometrika, Vol 63 (3), pp.581-590, 1976.https://doi.org/10.1093/biomet/63.3.581
Schafer, J. L., and J. W. Graham., "Missing Data: Our View of the State of the Art", Psychological Methods, Vol. 7 (2), pp.147-177, 2002.https://doi.org/10.1037/1082-989X.7.2.147PMid:12090408
Marsh, H. W., "Pairwise Deletion for Missing Data in Structural Equation Models: Nonpositive Definite Matrices, Parameter Estimates, Goodness of Fit, and Adjusted Sample Sizes", Structural Equation Modeling Vol. 5 (1), pp.22-36, 1998.https://doi.org/10.1080/10705519809540087
Miettinen, O. S., "Theoretical Epidemiology: Principles of Occurence Research in Medicine", John Wiley & Sons, 1985.
Stef van Buuren, "Flexible Imputation of Missing Data : 2nd Edition", CRC Press, 2018.https://doi.org/10.1201/9780429492259
Rubin, D. B., "Multiple Imputation for Nonresponse in Surveys", John Wiley & Sons, 1987.https://doi.org/10.1002/9780470316696
Rubin, D. B., "Multiple Imputation After 18+ Years", Journal of the American Statistical Association, Vol. 91 (434), pp.473-489, 1996.https://doi.org/10.1080/01621459.1996.10476908
Buuren, Stef., Groothuis-Oudshoorn, Catharina, "MICE: Multivariate Imputation by Chained Equations in R", Journal of Statistical Software, Vol. 45(3), pp.1-67, 2011.https://doi.org/10.18637/jss.v045.i03
Bodner, T. E., "What Improves with Increased Missing Data Imputations?", Structural Equation Modeling Vol. 15 (4), pp.651-675, 2008.https://doi.org/10.1080/10705510802339072
Von Hippel, "How Many Imputations Do You Need? A Two-Stage Calculation Using a Quadratic Rule", Sociological Methods & Research doi.org/10.1177/0049124117747303, 2018.https://doi.org/10.1177/0049124117747303
Madley-Dowd, R Hughes, K Tilling, and J Heron., "The proportion of missing data should not be used to guide decisions on multiple imputation", Journal of Clinical Epidemiology, Vol. 110, pp.63-73, 2019.https://doi.org/10.1016/j.jclinepi.2019.02.016PMid:30878639 PMCid:PMC6547017

Information

Publisher :The Society of Convergence Knowledge
Publisher(Ko) :융복합지식학회
Journal Title :The Society of Convergence Knowledge Transactions
Journal Title(Ko) :융복합지식학회논문지
Volume : 8
No :4
Pages :149-156
DOI :https://doi.org/10.22716/sckt.2020.8.4.044

[1] Little, R. J. A., and D. B. Rubin., "Statistical Analysis with Missing Data : 2nd Edition" , John Wiley & Sons, 2002.https://doi.org/10.1002/9781119013563

[2] 김연진, 박헌진, "미세먼지 자료에서의 결측치 대체 방법 비교", 한국빅데이터학회지, 제4권 제2호, pp.105-110, 2019.https://doi.org/10.36498/kbigdt.2019.4.2.105

[3] Rubin, D. B., "Inference and Missing Data", Biometrika, Vol 63 (3), pp.581-590, 1976.https://doi.org/10.1093/biomet/63.3.581

[4] Schafer, J. L., and J. W. Graham., "Missing Data: Our View of the State of the Art", Psychological Methods, Vol. 7 (2), pp.147-177, 2002.https://doi.org/10.1037/1082-989X.7.2.147PMid:12090408

[5] Marsh, H. W., "Pairwise Deletion for Missing Data in Structural Equation Models: Nonpositive Definite Matrices, Parameter Estimates, Goodness of Fit, and Adjusted Sample Sizes", Structural Equation Modeling Vol. 5 (1), pp.22-36, 1998.https://doi.org/10.1080/10705519809540087

[6] Miettinen, O. S., "Theoretical Epidemiology: Principles of Occurence Research in Medicine", John Wiley & Sons, 1985.

[7] Stef van Buuren, "Flexible Imputation of Missing Data : 2nd Edition", CRC Press, 2018.https://doi.org/10.1201/9780429492259

[8] Rubin, D. B., "Multiple Imputation for Nonresponse in Surveys", John Wiley & Sons, 1987.https://doi.org/10.1002/9780470316696

[9] Rubin, D. B., "Multiple Imputation After 18+ Years", Journal of the American Statistical Association, Vol. 91 (434), pp.473-489, 1996.https://doi.org/10.1080/01621459.1996.10476908

[10] Buuren, Stef., Groothuis-Oudshoorn, Catharina, "MICE: Multivariate Imputation by Chained Equations in R", Journal of Statistical Software, Vol. 45(3), pp.1-67, 2011.https://doi.org/10.18637/jss.v045.i03

[11] Bodner, T. E., "What Improves with Increased Missing Data Imputations?", Structural Equation Modeling Vol. 15 (4), pp.651-675, 2008.https://doi.org/10.1080/10705510802339072

[12] Von Hippel, "How Many Imputations Do You Need? A Two-Stage Calculation Using a Quadratic Rule", Sociological Methods & Research doi.org/10.1177/0049124117747303, 2018.https://doi.org/10.1177/0049124117747303

[13] Madley-Dowd, R Hughes, K Tilling, and J Heron., "The proportion of missing data should not be used to guide decisions on multiple imputation", Journal of Clinical Epidemiology, Vol. 110, pp.63-73, 2019.https://doi.org/10.1016/j.jclinepi.2019.02.016PMid:30878639 PMCid:PMC6547017

The Society of Convergence Knowledge Transactions ISSN:2287-8920(Print) 융복합지식학회논문지

All Issue