All Issue

2026 Vol.14, Issue 1 Preview Page

Research Article

31 March 2026. pp. 99-120
Abstract
최근 다양한 오디오 기반 딥러닝 응용 분야에서 데이터 부족과 높은 획득 비용, 윤리적 및 법적 제약으로 인해 양질의 오디오 데이터 확보에 한계가 존재한다. 기존 생성 모델은 단일 해상도 및 단일 스케일 구조로 인해 오디오 신호의 장단기 시간 패턴을 동시에 모델링하는 데 제약이 있다. 본 논문에서는 스펙트로그램 변환 없이 파형에서 직접 오디오를 생성하여 위상 정보 손실을 방지하는 WaveGAN을 기반으로, 다중 해상도 생성자와 다중 스케일 판별자를 통합한 파형 도메인 증강 모델(MRMS-WaveGAN)을 제안한다. 다중 해상도 생성자는 저·중·고해상도 경로의 세 분기로 구성된다. 저해상도 경로는 자기 주의(Self-Attention)를 적용하여 합성곱의 수용 범위를 초과하는 장거리 시간 의존성을 모델링하고, 중간 해상도 경로는 잔차 블록(Residual Block)을 통해 저해상도 경로의 거시 구조 표현을 보존하면서 중간 단위의 음향 특징을 추가하며, 고해상도 경로는 다층 전치 합성곱을 통해 세밀한 파형 세부 정보를 생성한다. 세 경로의 출력은 융합 모듈을 통해 통합되어 다중 시간 스케일의 특징이 상호보완적으로 반영된 오디오 파형을 생성한다. 다중 스케일 판별자는 원본, 2배 및 4배 다운 샘플된 세 스케일에서 병렬로 입력을 평가하여 파형 수준의 고주파 아티팩트 검출부터 장기 구조 일관성 평가까지 다층적 학습 피드백을 제공하며, 스펙트럴 정규화(Spectral Normalization)를 통해 학습 안정성을 확보한다.
Recently, Audio-based deep learning applications face significant challenges in acquiring high-quality data due to high collection costs, ethical considerations, and legal constraints. Existing generative models are limited by single-resolution and single-scale architectures, which hinder simultaneous modeling of short and long-term temporal patterns. This paper proposes MRMS-WaveGAN, a waveform-domain data augmentation model integrating a multi-resolution generator and a multi-scale discriminator, built upon WaveGAN, which generates audio directly from raw waveforms without spectrogram conversion to preserve phase information. The multi-resolution generator comprises three parallel branches at low, mid, and high resolutions. The low-resolution branch applies Self-Attention to capture long-range temporal dependencies beyond the receptive field of convolution. The mid-resolution branch uses a Residual Block to preserve macro-structural representations while incorporating intermediate acoustic features. The high-resolution branch generates fine-grained waveform details via stacked transposed convolutions. Outputs from all branches are integrated through a fusion module, producing waveforms that complementarily reflect multi-scale temporal features. The multi-scale discriminator evaluates inputs at original, 2×, and 4× downsampled scales in parallel, providing multi-level feedback from high-frequency artifact detection to long-term structural consistency assessment, with Spectral Normalization applied to ensure training stability.
References
  1. A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645-6649, 2013.

    10.1109/ICASSP.2013.6638947
  2. B. Schuller, S. Steidl and A. Batliner, “The interspeech 2009 emotion challenge,” Proceedings of INTERSPEECH, pp. 312-315, 2009.

    10.21437/Interspeech.2009-103
  3. Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Cooperative learning and its application to emotion recognition from speech,” IEEE Signal Processing Magazine, Vol. 34, No. 4, pp. 54-65, 2017.

  4. A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1-5, 2018.

  5. Y. Koizumi, S. Saito, H. U. Y. Kawachi, and N. Harada, “Unsupervised detection of anomalous sound based on deep learning and the Neyman-Pearson lemma,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 27, No. 1, pp. 212-224, 2019.

    10.1109/TASLP.2018.2877258
  6. E. Fonseca, M. Plakal, F. Font, P. W. Ellis Daniel, and X. Serra,“Audio tagging with noisy labels and minimal supervision,” Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1-5, 2019.

    10.33682/w13e-5v06
  7. J. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210, 2015.

    10.1109/ICASSP.2015.7178964
  8. F. Ahmad, A. Shah, and J. Kamruzzaman, “Deep learning based classification of underwater acoustic signals,” Procedia Computer Science, vol. 227, pp. 256-263, 2024.

  9. A. Tran, T. D. Nguyen, and Q. V. Nguyen, “A comprehensive survey and yaxonomy on privacy-preserving deep learning,” Neural Networks, Vol. 174, pp. 428-460, 2024.

  10. Y. Sun, K. Xu, C. Liu, Y. Dou, H. Wang, and B. Ding, “Automated data augmentation for audio classification,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 32, pp. 2716-2728, 2024.

    10.1109/TASLP.2024.3402049
  11. S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Speech emotion recognition using deep learning techniques: A review,” IEEE Transactions on Affective Computing, Vol. 11, No. 4, pp. 652-672, 2020.

  12. C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” International Conference on Learning Representations (ICLR), pp. 1-16, 2019.

  13. A. Vaswani,N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Neural Information Processing Systems (NIPS 2017), pp. 1-11, 2017.

  14. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.

    10.1109/CVPR.2016.90
  15. T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” Proceedings of ICLR, pp. 1-26, 2018.

    10.1007/978-3-030-03243-2_860-1
  16. Y. Sun, K. Xu, C. Liu, Y. Dou, H. Wang, and B. Ding, “Automated data augmentation for audio classification," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 32, pp. 2716-2728, 2024.

    10.1109/TASLP.2024.3402049
  17. K. M. Rezau, M. Jewel, S. Islam, and K. Noor, “Enhancing audio classification through MFCC feature extraction and data augmentation with CNN and RNN models,” International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 2, No. 2, pp. 63-84, 2024.

  18. A. Fathan, J. Alam, and W. H. Kang, “Mel-spectrogram image-based end-to-end audio deepfake detection under channel-mismatched conditions,” 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6, 2022.

    10.1109/ICME52920.2022.9859621
  19. S. D. Handy Permana, and T. K. A. Rahman, “Improved feature extraction for sound recognition using Combined Constant-Q Transform (CQT) and mel spectrogram for CNN Input,” 2023 International Conference on Modeling & E-Information Research, Artificial Learning and Digital Applications (ICMERALDA), pp. 185-190, 2023.

    10.1109/ICMERALDA60125.2023.10458162
  20. D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” International Journal of Advanced Computer Science and Applications (IJACSA), Proceedings of INTERSPEECH 2019, pp. 2613-2617, 2019.

    10.21437/Interspeech.2019-2680
  21. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond empirical risk minimization,” International Conference on Learning Representations (ICLR), pp. 1-8, 2018.

  22. S. Yun, D. Han, S. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization strategy to train strong cassifiers with localizable features,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6022-6031, 2019.

    10.1109/ICCV.2019.00612
  23. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems (NeurIPS), pp. 1-9, 2014.

  24. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference on Learning Representations (CoRR), pp. 1-14, 2013.

  25. X. Xie, R. Ruzi, X. Liu, and L. Wang, “Variational auto-encoder based variability encoding for dysarthric speech recognition,” Proceedings of INTERSPEECH 2021, pp. 4808-4812, 2021.

    10.21437/Interspeech.2021-173
  26. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems (NeurIPS), pp. 6840-6851, 2020.

  27. S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “Generalization and equilibrium in generative adversarial nets (GANs),” International Conference on Machine Learning (ICML), pp. 1-27, 2017.

  28. K. Kumar, R. Kumar, T. Boissiere, L. Gestin, W. Teoh, J. Sotelo, A. Brebisson, Y. Bengio, and A. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 1-14, 2019.

  29. R. Yamamoto, E. Song, and J. -M. Kim, “Parallel Wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.

    10.1109/ICASSP40776.2020.9053795
  30. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” 34th Conference on Neural Information Processing Systems (NeurIPS 2020), pp. 1-14, 2020.

  31. A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” International Conference on Learning Representations (ICLR), pp. 1-16, 2016.

  32. D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 32, No. 2, pp. 236-243, 1984.

    10.1109/TASSP.1984.1164317
  33. N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast Griffin-Lim algorithm,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1-4, 2013.

    10.1109/WASPAA.2013.6701851
  34. S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” International Conference on Learning Representations (ICLR), pp. 1-20, 2023.

  35. O. Oktay, J. Schlemper, L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, “Attention U-Net: Learning where to look for the pancreas,” Medical Imaging with Deep Learning (MIDL), pp. 1-10, 2018.

  36. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein GANs,” Advances in Neural Information Processing Systems (NeurIPS), Vol. 30, pp. 5767-5777, 2017.

  37. S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and machine Intelligence (PAMI), Vol. 11, No. 7, pp. 764-693, 1989.

    10.1109/34.192463
  38. L. R. Rabinaer and R. W. Schafer, Theory and applications of digital speech processing, Pearson, 2010.

  39. M. Irfan, Z. Jiangbin, S. Ali, M. Iqbal, Z Masood, and U. Hamid, “DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification,” Expert Systems with Applications, Vol. 183, 115270, pp. 2021.

    10.1016/j.eswa.2021.115270
  40. F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in 9th European Conference on Speech Communication and Technology, pp. 1-4, 2005.

    10.21437/Interspeech.2005-446
  41. G. Tzanetakis, and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on Speech and Audio Processing, Vol. 10, No. 5, pp. 293-302, 2002.

    10.1109/TSA.2002.800560
  42. Karol J. Piczak, “ESC: Dataset for environmental sound classification,” In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015-1018, 2015.

    10.1145/2733373.2806390
Information
  • Publisher :The Society of Convergence Knowledge
  • Publisher(Ko) :융복합지식학회
  • Journal Title :The Society of Convergence Knowledge Transactions
  • Journal Title(Ko) :융복합지식학회논문지
  • Volume : 14
  • No :1
  • Pages :99-120