MRMS-WaveGAN : A Multi-Resolution and Multi-Scale WaveGAN Extension for Waveform-Domain Audio Data Augmentation

doi:10.22716/sckt.2026.14.1.009

All Issue

2026 Vol.14, Issue 1 Preview Page Next Page

Research Article

MRMS-WaveGAN : A Multi-Resolution and Multi-Scale WaveGAN Extension for Waveform-Domain Audio Data Augmentation MRMS-WaveGAN : 파형 도메인 오디오 데이터 증강을 위한 다중 해상도 · 다중 스케일 기반의 WaveGAN 확장 모델: Do Kyung Shin¹
신도경¹; ¹Chief Research Engineer, LIG Nex1, Maritime Lab. LIG Nex1

¹엘아이지 넥스원 해양연구소 수석연구원

31 March 2026. pp. 99-120

PDF

Abstract

최근 다양한 오디오 기반 딥러닝 응용 분야에서 데이터 부족과 높은 획득 비용, 윤리적 및 법적 제약으로 인해 양질의 오디오 데이터 확보에 한계가 존재한다. 기존 생성 모델은 단일 해상도 및 단일 스케일 구조로 인해 오디오 신호의 장단기 시간 패턴을 동시에 모델링하는 데 제약이 있다. 본 논문에서는 스펙트로그램 변환 없이 파형에서 직접 오디오를 생성하여 위상 정보 손실을 방지하는 WaveGAN을 기반으로, 다중 해상도 생성자와 다중 스케일 판별자를 통합한 파형 도메인 증강 모델(MRMS-WaveGAN)을 제안한다. 다중 해상도 생성자는 저·중·고해상도 경로의 세 분기로 구성된다. 저해상도 경로는 자기 주의(Self-Attention)를 적용하여 합성곱의 수용 범위를 초과하는 장거리 시간 의존성을 모델링하고, 중간 해상도 경로는 잔차 블록(Residual Block)을 통해 저해상도 경로의 거시 구조 표현을 보존하면서 중간 단위의 음향 특징을 추가하며, 고해상도 경로는 다층 전치 합성곱을 통해 세밀한 파형 세부 정보를 생성한다. 세 경로의 출력은 융합 모듈을 통해 통합되어 다중 시간 스케일의 특징이 상호보완적으로 반영된 오디오 파형을 생성한다. 다중 스케일 판별자는 원본, 2배 및 4배 다운 샘플된 세 스케일에서 병렬로 입력을 평가하여 파형 수준의 고주파 아티팩트 검출부터 장기 구조 일관성 평가까지 다층적 학습 피드백을 제공하며, 스펙트럴 정규화(Spectral Normalization)를 통해 학습 안정성을 확보한다.

Recently, Audio-based deep learning applications face significant challenges in acquiring high-quality data due to high collection costs, ethical considerations, and legal constraints. Existing generative models are limited by single-resolution and single-scale architectures, which hinder simultaneous modeling of short and long-term temporal patterns. This paper proposes MRMS-WaveGAN, a waveform-domain data augmentation model integrating a multi-resolution generator and a multi-scale discriminator, built upon WaveGAN, which generates audio directly from raw waveforms without spectrogram conversion to preserve phase information. The multi-resolution generator comprises three parallel branches at low, mid, and high resolutions. The low-resolution branch applies Self-Attention to capture long-range temporal dependencies beyond the receptive field of convolution. The mid-resolution branch uses a Residual Block to preserve macro-structural representations while incorporating intermediate acoustic features. The high-resolution branch generates fine-grained waveform details via stacked transposed convolutions. Outputs from all branches are integrated through a fusion module, producing waveforms that complementarily reflect multi-scale temporal features. The multi-scale discriminator evaluates inputs at original, 2×, and 4× downsampled scales in parallel, providing multi-level feedback from high-frequency artifact detection to long-term structural consistency assessment, with Spectral Normalization applied to ensure training stability.

Keywords

Audio Data Augmentation

WaveGAN

Deep-Learning

Audio Recognition/Classification

DeepShip

EMO-DB

ESC-50

GTZAN

References

A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645-6649, 2013.
10.1109/ICASSP.2013.6638947
B. Schuller, S. Steidl and A. Batliner, “The interspeech 2009 emotion challenge,” Proceedings of INTERSPEECH, pp. 312-315, 2009.
10.21437/Interspeech.2009-103
Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Cooperative learning and its application to emotion recognition from speech,” IEEE Signal Processing Magazine, Vol. 34, No. 4, pp. 54-65, 2017.
A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1-5, 2018.
Y. Koizumi, S. Saito, H. U. Y. Kawachi, and N. Harada, “Unsupervised detection of anomalous sound based on deep learning and the Neyman-Pearson lemma,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 27, No. 1, pp. 212-224, 2019.
10.1109/TASLP.2018.2877258
E. Fonseca, M. Plakal, F. Font, P. W. Ellis Daniel, and X. Serra,“Audio tagging with noisy labels and minimal supervision,” Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1-5, 2019.
10.33682/w13e-5v06
J. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210, 2015.
10.1109/ICASSP.2015.7178964
F. Ahmad, A. Shah, and J. Kamruzzaman, “Deep learning based classification of underwater acoustic signals,” Procedia Computer Science, vol. 227, pp. 256-263, 2024.
A. Tran, T. D. Nguyen, and Q. V. Nguyen, “A comprehensive survey and yaxonomy on privacy-preserving deep learning,” Neural Networks, Vol. 174, pp. 428-460, 2024.
Y. Sun, K. Xu, C. Liu, Y. Dou, H. Wang, and B. Ding, “Automated data augmentation for audio classification,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 32, pp. 2716-2728, 2024.
10.1109/TASLP.2024.3402049
S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Speech emotion recognition using deep learning techniques: A review,” IEEE Transactions on Affective Computing, Vol. 11, No. 4, pp. 652-672, 2020.
C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” International Conference on Learning Representations (ICLR), pp. 1-16, 2019.
A. Vaswani,N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Neural Information Processing Systems (NIPS 2017), pp. 1-11, 2017.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
10.1109/CVPR.2016.90
T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” Proceedings of ICLR, pp. 1-26, 2018.
10.1007/978-3-030-03243-2_860-1
Y. Sun, K. Xu, C. Liu, Y. Dou, H. Wang, and B. Ding, “Automated data augmentation for audio classification," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 32, pp. 2716-2728, 2024.
10.1109/TASLP.2024.3402049
K. M. Rezau, M. Jewel, S. Islam, and K. Noor, “Enhancing audio classification through MFCC feature extraction and data augmentation with CNN and RNN models,” International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 2, No. 2, pp. 63-84, 2024.
A. Fathan, J. Alam, and W. H. Kang, “Mel-spectrogram image-based end-to-end audio deepfake detection under channel-mismatched conditions,” 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6, 2022.
10.1109/ICME52920.2022.9859621
S. D. Handy Permana, and T. K. A. Rahman, “Improved feature extraction for sound recognition using Combined Constant-Q Transform (CQT) and mel spectrogram for CNN Input,” 2023 International Conference on Modeling & E-Information Research, Artificial Learning and Digital Applications (ICMERALDA), pp. 185-190, 2023.
10.1109/ICMERALDA60125.2023.10458162
D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” International Journal of Advanced Computer Science and Applications (IJACSA), Proceedings of INTERSPEECH 2019, pp. 2613-2617, 2019.
10.21437/Interspeech.2019-2680
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond empirical risk minimization,” International Conference on Learning Representations (ICLR), pp. 1-8, 2018.
S. Yun, D. Han, S. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization strategy to train strong cassifiers with localizable features,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6022-6031, 2019.
10.1109/ICCV.2019.00612
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems (NeurIPS), pp. 1-9, 2014.
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference on Learning Representations (CoRR), pp. 1-14, 2013.
X. Xie, R. Ruzi, X. Liu, and L. Wang, “Variational auto-encoder based variability encoding for dysarthric speech recognition,” Proceedings of INTERSPEECH 2021, pp. 4808-4812, 2021.
10.21437/Interspeech.2021-173
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems (NeurIPS), pp. 6840-6851, 2020.
S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “Generalization and equilibrium in generative adversarial nets (GANs),” International Conference on Machine Learning (ICML), pp. 1-27, 2017.
K. Kumar, R. Kumar, T. Boissiere, L. Gestin, W. Teoh, J. Sotelo, A. Brebisson, Y. Bengio, and A. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 1-14, 2019.
R. Yamamoto, E. Song, and J. -M. Kim, “Parallel Wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.
10.1109/ICASSP40776.2020.9053795
J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” 34th Conference on Neural Information Processing Systems (NeurIPS 2020), pp. 1-14, 2020.
A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” International Conference on Learning Representations (ICLR), pp. 1-16, 2016.
D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 32, No. 2, pp. 236-243, 1984.
10.1109/TASSP.1984.1164317
N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast Griffin-Lim algorithm,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1-4, 2013.
10.1109/WASPAA.2013.6701851
S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” International Conference on Learning Representations (ICLR), pp. 1-20, 2023.
O. Oktay, J. Schlemper, L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, “Attention U-Net: Learning where to look for the pancreas,” Medical Imaging with Deep Learning (MIDL), pp. 1-10, 2018.
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein GANs,” Advances in Neural Information Processing Systems (NeurIPS), Vol. 30, pp. 5767-5777, 2017.
S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and machine Intelligence (PAMI), Vol. 11, No. 7, pp. 764-693, 1989.
10.1109/34.192463
L. R. Rabinaer and R. W. Schafer, Theory and applications of digital speech processing, Pearson, 2010.
M. Irfan, Z. Jiangbin, S. Ali, M. Iqbal, Z Masood, and U. Hamid, “DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification,” Expert Systems with Applications, Vol. 183, 115270, pp. 2021.
10.1016/j.eswa.2021.115270
F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in 9th European Conference on Speech Communication and Technology, pp. 1-4, 2005.
10.21437/Interspeech.2005-446
G. Tzanetakis, and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on Speech and Audio Processing, Vol. 10, No. 5, pp. 293-302, 2002.
10.1109/TSA.2002.800560
Karol J. Piczak, “ESC: Dataset for environmental sound classification,” In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015-1018, 2015.
10.1145/2733373.2806390

Information

Publisher :The Society of Convergence Knowledge
Publisher(Ko) :융복합지식학회
Journal Title :The Society of Convergence Knowledge Transactions
Journal Title(Ko) :융복합지식학회논문지
Volume : 14
No :1
Pages :99-120
DOI :https://doi.org/10.22716/sckt.2026.14.1.009

[1] A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645-6649, 2013.
10.1109/ICASSP.2013.6638947

[2] B. Schuller, S. Steidl and A. Batliner, “The interspeech 2009 emotion challenge,” Proceedings of INTERSPEECH, pp. 312-315, 2009.
10.21437/Interspeech.2009-103

[3] Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Cooperative learning and its application to emotion recognition from speech,” IEEE Signal Processing Magazine, Vol. 34, No. 4, pp. 54-65, 2017.

[4] A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1-5, 2018.

[5] Y. Koizumi, S. Saito, H. U. Y. Kawachi, and N. Harada, “Unsupervised detection of anomalous sound based on deep learning and the Neyman-Pearson lemma,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 27, No. 1, pp. 212-224, 2019.
10.1109/TASLP.2018.2877258

[6] E. Fonseca, M. Plakal, F. Font, P. W. Ellis Daniel, and X. Serra,“Audio tagging with noisy labels and minimal supervision,” Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1-5, 2019.
10.33682/w13e-5v06

[7] J. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210, 2015.
10.1109/ICASSP.2015.7178964

[8] F. Ahmad, A. Shah, and J. Kamruzzaman, “Deep learning based classification of underwater acoustic signals,” Procedia Computer Science, vol. 227, pp. 256-263, 2024.

[9] A. Tran, T. D. Nguyen, and Q. V. Nguyen, “A comprehensive survey and yaxonomy on privacy-preserving deep learning,” Neural Networks, Vol. 174, pp. 428-460, 2024.

[10] Y. Sun, K. Xu, C. Liu, Y. Dou, H. Wang, and B. Ding, “Automated data augmentation for audio classification,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 32, pp. 2716-2728, 2024.
10.1109/TASLP.2024.3402049

[11] S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Speech emotion recognition using deep learning techniques: A review,” IEEE Transactions on Affective Computing, Vol. 11, No. 4, pp. 652-672, 2020.

[12] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” International Conference on Learning Representations (ICLR), pp. 1-16, 2019.

[13] A. Vaswani,N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Neural Information Processing Systems (NIPS 2017), pp. 1-11, 2017.

[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
10.1109/CVPR.2016.90

[15] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” Proceedings of ICLR, pp. 1-26, 2018.
10.1007/978-3-030-03243-2_860-1

[16] Y. Sun, K. Xu, C. Liu, Y. Dou, H. Wang, and B. Ding, “Automated data augmentation for audio classification," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 32, pp. 2716-2728, 2024.
10.1109/TASLP.2024.3402049

[17] K. M. Rezau, M. Jewel, S. Islam, and K. Noor, “Enhancing audio classification through MFCC feature extraction and data augmentation with CNN and RNN models,” International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 2, No. 2, pp. 63-84, 2024.

[18] A. Fathan, J. Alam, and W. H. Kang, “Mel-spectrogram image-based end-to-end audio deepfake detection under channel-mismatched conditions,” 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6, 2022.
10.1109/ICME52920.2022.9859621

[19] S. D. Handy Permana, and T. K. A. Rahman, “Improved feature extraction for sound recognition using Combined Constant-Q Transform (CQT) and mel spectrogram for CNN Input,” 2023 International Conference on Modeling & E-Information Research, Artificial Learning and Digital Applications (ICMERALDA), pp. 185-190, 2023.
10.1109/ICMERALDA60125.2023.10458162

[20] D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” International Journal of Advanced Computer Science and Applications (IJACSA), Proceedings of INTERSPEECH 2019, pp. 2613-2617, 2019.
10.21437/Interspeech.2019-2680

[21] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond empirical risk minimization,” International Conference on Learning Representations (ICLR), pp. 1-8, 2018.

[22] S. Yun, D. Han, S. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization strategy to train strong cassifiers with localizable features,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6022-6031, 2019.
10.1109/ICCV.2019.00612

[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems (NeurIPS), pp. 1-9, 2014.

[24] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference on Learning Representations (CoRR), pp. 1-14, 2013.

[25] X. Xie, R. Ruzi, X. Liu, and L. Wang, “Variational auto-encoder based variability encoding for dysarthric speech recognition,” Proceedings of INTERSPEECH 2021, pp. 4808-4812, 2021.
10.21437/Interspeech.2021-173

[26] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems (NeurIPS), pp. 6840-6851, 2020.

[27] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “Generalization and equilibrium in generative adversarial nets (GANs),” International Conference on Machine Learning (ICML), pp. 1-27, 2017.

[28] K. Kumar, R. Kumar, T. Boissiere, L. Gestin, W. Teoh, J. Sotelo, A. Brebisson, Y. Bengio, and A. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 1-14, 2019.

[29] R. Yamamoto, E. Song, and J. -M. Kim, “Parallel Wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.
10.1109/ICASSP40776.2020.9053795

[30] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” 34th Conference on Neural Information Processing Systems (NeurIPS 2020), pp. 1-14, 2020.

[31] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” International Conference on Learning Representations (ICLR), pp. 1-16, 2016.

[32] D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 32, No. 2, pp. 236-243, 1984.
10.1109/TASSP.1984.1164317

[33] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast Griffin-Lim algorithm,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1-4, 2013.
10.1109/WASPAA.2013.6701851

[34] S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” International Conference on Learning Representations (ICLR), pp. 1-20, 2023.

[35] O. Oktay, J. Schlemper, L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, “Attention U-Net: Learning where to look for the pancreas,” Medical Imaging with Deep Learning (MIDL), pp. 1-10, 2018.

[36] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein GANs,” Advances in Neural Information Processing Systems (NeurIPS), Vol. 30, pp. 5767-5777, 2017.

[37] S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and machine Intelligence (PAMI), Vol. 11, No. 7, pp. 764-693, 1989.
10.1109/34.192463

[38] L. R. Rabinaer and R. W. Schafer, Theory and applications of digital speech processing, Pearson, 2010.

[39] M. Irfan, Z. Jiangbin, S. Ali, M. Iqbal, Z Masood, and U. Hamid, “DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification,” Expert Systems with Applications, Vol. 183, 115270, pp. 2021.
10.1016/j.eswa.2021.115270

[40] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in 9th European Conference on Speech Communication and Technology, pp. 1-4, 2005.
10.21437/Interspeech.2005-446

[41] G. Tzanetakis, and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on Speech and Audio Processing, Vol. 10, No. 5, pp. 293-302, 2002.
10.1109/TSA.2002.800560

[42] Karol J. Piczak, “ESC: Dataset for environmental sound classification,” In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015-1018, 2015.
10.1145/2733373.2806390

The Society of Convergence Knowledge Transactions ISSN:2287-8920(Print) 융복합지식학회논문지

All Issue