All Issue

2024 Vol.12, Issue 3 Preview Page

Research Article

30 September 2024. pp. 81-103
Abstract
최근 딥러닝 기술은 다양한 분야의 분류 시스템에 활용됨에 따라 점차 딥러닝 모델의 성능을 극대화하기 위한 연구가 활발하게 진행되고 있다. 딥러닝 모델의 성능은 학습 데이터의 양과 품질에 따라 많은 영향을 받으며, 딥러닝 모델은 깊고 복잡하게 설계될수록 더 많은 학습 데이터가 요구된다. 또한 학습 데이터가 부족하거나 클래스 간의 데이터 불균형이 존재할 경우, 과 적합 현상이 발생하며 성능이 저하되는 문제가 발생한다. 음성 및 오디오를 활용하는 분야에서 분류 성능을 높이기 위해서 학습 데이터 확장 및 클래스 간 불균형 문제는 중요한 이슈이며, 이를 해결하기 위한 연구는 반드시 필요하다. 본 논문에서는 2차원 이미지 증강을 위해 고안된 WGAN 모델을 개선하여 1차원 오디오 신호를 효과적으로 증강하는 W2GAN-GP 증강 모델을 제안한다. 원 신호 데이터를 입력 받아 시간-주파수 변환 기법을 이용하여 1차 증강을 수행하고, 제안된 W2GAN-GP 모델을 이용하여 2차 증강을 통한 듀얼 오디오 신호 증강 기법을 제안한다. 또한 제안한 증강기법으로 생성된 오디오 데이터의 유효성을 검증하기 위해서 ResNet50 및 DenseNet 분류 모델을 이용하여 분류 정확도를 측정하였다. 분류 모델을 통한 검증 결과, 증강을 수행하지 않은 경우보다 약 27~30% 의 정확도가 높아진 것을 확인할 수 있었다.
Recently, as deep learning technology is being utilized in classification systems in various fields, research is being actively conducted to maximize the performance of deep learning models. The performance of deep learning models is greatly affected by the amount and quality of training data, and the deeper and more complex the design of a deep learning model, the more training data is required. Additionally, if training data is insufficient or there is data imbalance between classes, over-fitting occurs and performance deteriorates. In order to improve classification performance in fields that utilize voice and audio, learning data expansion and imbalance between classes are important issues, and research to resolve these issues is essential. In this paper, we propose a W2GAN-GP augmentation model that effectively augments one-dimensional audio signals by improving the WGAN model designed for two-dimensional image augmentation. We propose a dual audio signal augmentation technique by performing the first augmentation on the original signal data using the time-frequency transform technique, and then performing the second augmentation using the proposed W2GAN-GP model. To verify the validity of the audio data generated by the proposed augmentation technique, classification accuracy was measured using ResNet50 and DenseNet classification models. As a result of verification through the classification model, it was confirmed that the accuracy increased by about 27 to 30% compared to the case where augmentation was not performed.
References
  1. L. Perez and J. Wang. "The effectiveness of data augmentation in image classification using deep learning", CoRR, pp. 1-8, 2017.

  2. O. Abayomi-Alli, R. Damaševičius, A. Qazi, M. Adedoyin-Olowe, and S. Misra, "Data augmentation and deep learning methods in sound classification: a systematic review", Electronics.. Basel : MDPI., Vol. 11. Issue 22, No. 3795, pp. 1-32, 2022.

    10.3390/electronics11223795
  3. H. Chu, Y. Zhang, and H. Chiang, "A CNN Sound Classification Mechanism Using Data Augmentation", Sensors (Basel), Vol. 23, No. 15, 6972, 2023.

    10.3390/s23156972
  4. W. Han, Z. Zhang, Y. Zhang, J. Yu, C.C. Chiu, J Qin, A. Gulati, R. Pang, and Y. Wu, "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context", In Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Shanghai, pp. 3610-3614, 2020.

    10.21437/Interspeech.2020-2059
  5. M. Orken, O. Dina, A. Keylan, T. Tolganay, and O. Mohamed, "A study of transformer-based end-to-end speech recognition system for Kazakh language", Sci. Rep, Vol. 12, 8337, 2022.

    10.1038/s41598-022-12260-y
  6. Y. Zhang, "Music Recommendation System and Recommendation Model Based on Convolutional Neural Network", Mob. Inf. Syst. 3387598, 2022.

    10.1155/2022/3387598
  7. K. Huang, H. Qin, X. Zhang, and H. Zhang, "Music recommender system based on graph convolutional neural networks with attention mechanism", Neural Netw, Vol. 135, pp. 107-117, 2021.

  8. G. Marc, M. Damian, "Environmental sound monitoring using machine learning on mobile devices", Appl. Acoust, Vol. 159, 107041, 2020.

    10.1016/j.apacoust.2019.107041
  9. A. Nogueira, H. Oliveira, J. Machado, and J. Tavares, "Sound Classification and Processing of Urban Environments: A Systematic Literature Review", Sensors, Vol. 22, 8608, 2022.

    10.3390/s22228608
  10. T. Nishida, K. Dohi, T. Endo, M. Yamamoto, and Y. Kawaguchi, "Anomalous Sound Detection Based on Machine Activity Detection", In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), pp. 269-273, 2022.

    10.23919/EUSIPCO55093.2022.9909901
  11. Y. Wang, Y. Zheng, Y. Zhang, Y. Xie, S. Xu, Y. Hu, and L. He, "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Using Classification-Based Methods", Appl. Sci. Vol. 11, 11128, 2021.

    10.3390/app112311128
  12. M. Crocco, M. Cristani, and A. Trucco, "Murino, V. Audio surveillance: A systematic review", ACM Comput. Surv. Vol. 48, pp. 1-46, 2016.

    10.1145/2871183
  13. Y. Leng, W. Zhao, C. Lin, C. Sun, R. Wang, Q. Yuan, and D. Li, "LDA-based data augmentation algorithm for acoustic scene classification", Knowl.-Based Syst. Vol. 195, 105600, 2020.

    10.1016/j.knosys.2020.105600
  14. M. Lech, M. Stolar, C. Best, and R. Bolia, "Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding", Frontiers in Computer Science, Vol. 2, No. 14, pp. 1-14, 2020.

    10.3389/fcomp.2020.00014
  15. N. Takahashi, M. Gygli, and L. Van Gool, "AENet: Learning deep audio features for video analysis", IEEE Trans. Vol. 20, pp. 513-524, 2018.

    10.1109/TMM.2017.2751969
  16. L. Haohe, C. Zehua, Y. Yi, M. Xinhao, L. Xubo, M. Danilo, W. Wenwu, and D. P. Mark, "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models", International Conference on Machine Learning, pp. 1-25, 2023.

  17. H. Alonso, M. Barragán Pulido, J. Gil Bordón, M. Ferrer Ballester, C. Travieso González, "Speech evaluation of patients with alzheimer's disease using an automatic interviewer", Expert Syst. Vol. 192, 6386, 2022.

    10.1016/j.eswa.2021.116386
  18. Y. Jeong, J. Kim, D. Kim, J. Kim, and K. Lee, "Methods for improving deep learning-based cardiac auscultation accuracy: Data augmentation and data generalization", Appl. Sci. Vol. 11, 4544, 2021.

    10.3390/app11104544
  19. Y. Sun, A. Wong, and M. Kamel, "Classification of Imbalanced Data: A Review", Int. J. Pattern Recognit. Vol. 23, pp. 687-719, 2009.

    10.1142/S0218001409007326
  20. L. Ferreira-Paiva, E. Alfaro-Espinoza, V. Almeida, L. Felix, and R. Neves, "A Survey of Data Augmentation for Audio Classification", Sociedade Brasileira de Automática (SBA), Vol. 3, No. 1, pp. 2165-2172, 2022.

    10.20906/CBA2022/3469
  21. H. Purwins, B. Li, T. Virtanen, J. Schlüter, S. Y. Chang, and T. Sainath, "Deep Learning for Audio Signal Processing", in IEEE Journal of Selected Topics in Signal Processing, Vol. 13, No. 2, pp. 206-219, 2019.

    10.1109/JSTSP.2019.2908700
  22. D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", Proc. Interspeech 2019, pp. 2613-2617, 2019.

    10.21437/Interspeech.2019-2680
  23. A. Jain, P.R. Samala, D. Mittal, P. Jyothi, and M. Singh, "SPLICEOUT: A Simple and Efficient Audio Augmentation Method", Processing Interspeech, pp. 2678-2682, 2022.

    10.21437/Interspeech.2022-572
  24. S. Yun, D. Han, S. Chun, S. Oh, Y. Yoo, and J. Choe, "Cutmix: Regularization strategy to train strong classifiers with localizable features", In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6022-6031, 2019.

    10.1109/ICCV.2019.00612
  25. G. Kim, D. K. Han, and H. Ko, "SpecMix: Data Augmentation for Speech Recognition", Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 1, pp. 6-10, 2021.

  26. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "Mixup: Beyond empirical risk minimization", In International Conference on Learning Representations(ICLR), pp. 1-13, 2018.

  27. C. Donahue, J. McAuley, and M. Puckette, "Adversarial Audio Synthesis", International Conference on Learning Representations (ICLR), pp. 1-15, 2018.

  28. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial networks", in Advances in Neural Information Processing Systems(NIPS), pp. 2672-2680, 2014.

  29. K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville, "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis", Advances in Neural Information Processing Systems (NeurIPS), 14910-14921, 2019.

  30. H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "CycleGAN-VC: Non-parallel Voice Conversion using Cycle-Consistent Adversarial Networks," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5279-5283, 2018.

    10.1109/ICASSP.2018.8462342
  31. R. Lengyel, E. Moliner, M. Zwicker and T. Gerkmann, "StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6034-6038, 2021.

  32. Carlo Aironi, Samuele Cornell, Luca Serafini, Stefano Squartini, "A Time-Frequency Generative Adversarial based method for Audio Packet Loss Concealment", European Signal Processing Conference, pp.1-5, 2023.

    10.23919/EUSIPCO58844.2023.10290027
  33. R. Prenger, R. Valle, and B. Catanzaro, "WaveGlow: A Flow-based Generative Network for Speech Synthesis", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617-3621, 2019.

    10.1109/ICASSP.2019.8683143
  34. R. Yamamoto, E. Song, and J. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.

    10.1109/ICASSP40776.2020.9053795
  35. J. Kong, J. Kim, and J. Bae, "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis", Conference on Neural Information Processing Systems (NeurIPS 2020), pp. 1-14, 2020.

  36. J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, "GANSynth: Adversarial Neural Audio Synthesis", International Conference on Learning Representations (ICLR), pp. 1-17, 2019.

  37. M. Arjovsky, S. Chintala, and L. Bottou, "Wasserstein GAN", Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 214-223, 2017.

  38. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, "Improved training of wasserstein GANs", In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), pp. 5769 - 5779, 2017.

  39. T. Karras, S. Laine, and T. Aila, "A Style-Based Generator Architecture for Generative Adversarial Networks", in IEEE Transactions on Pattern Analysis & Machine Intelligence, Vol. 43, No. 12, pp. 4217-4228, 2021.

    10.1109/TPAMI.2020.2970919
  40. P. Isola, J. Zhu, T. Zhou, and A. A. Efros, "Image-to-Image Translation with Conditional Adversarial Networks", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967-5976, 2017.

    10.1109/CVPR.2017.632
  41. M. Mirza and O. Simon, "Conditional Generative Adversarial Nets", arXiv preprint arXiv, pp. 1-7, 2014.

  42. D. P. Kingma and P. Dhariwal, "Glow: Generative flow with invertible 1x1 convolutions", Advances in Neural Information Processing Systems (NeurIPS 2018), pp. 1-10, 2018.

  43. A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio", Speech Synthesis Workshop, pp. 1-15, 2016.

  44. S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, "SampleRNN: An unconditional end-to-end neural audio generation model", ICLR 2017.

  45. C. Donahue, J. McAuley, and M. Puckette, "Adversarial Audio Synthesis", International Conference on Learning Representations.

Information
  • Publisher :The Society of Convergence Knowledge
  • Publisher(Ko) :융복합지식학회
  • Journal Title :The Society of Convergence Knowledge Transactions
  • Journal Title(Ko) :융복합지식학회논문지
  • Volume : 12
  • No :3
  • Pages :81-103