Improved Audio Data Augmentation based on Time-Frequency Domain Transformation and W2GAN-GP Model

doi:10.22716/sckt.2024.12.3.007

All Issue

2024 Vol.12, Issue 3 Preview Page Next Page

Research Article

Improved Audio Data Augmentation based on Time-Frequency Domain Transformation and W2GAN-GP Model 시간-주파수 도메인 변환 및 W2GAN-GP 모델 기반의 향상된 오디오 데이터 증강: Do Kyung Shin¹ and Young Dae Kim²
신도경¹, 김영대²; ^1,2Chief Research Engineer, Maritime Lab., LIG Nex1

^1,2엘아이지 넥스원 해양연구소 수석연구원

30 September 2024. pp. 81-103

PDF

Abstract

최근 딥러닝 기술은 다양한 분야의 분류 시스템에 활용됨에 따라 점차 딥러닝 모델의 성능을 극대화하기 위한 연구가 활발하게 진행되고 있다. 딥러닝 모델의 성능은 학습 데이터의 양과 품질에 따라 많은 영향을 받으며, 딥러닝 모델은 깊고 복잡하게 설계될수록 더 많은 학습 데이터가 요구된다. 또한 학습 데이터가 부족하거나 클래스 간의 데이터 불균형이 존재할 경우, 과 적합 현상이 발생하며 성능이 저하되는 문제가 발생한다. 음성 및 오디오를 활용하는 분야에서 분류 성능을 높이기 위해서 학습 데이터 확장 및 클래스 간 불균형 문제는 중요한 이슈이며, 이를 해결하기 위한 연구는 반드시 필요하다. 본 논문에서는 2차원 이미지 증강을 위해 고안된 WGAN 모델을 개선하여 1차원 오디오 신호를 효과적으로 증강하는 W2GAN-GP 증강 모델을 제안한다. 원 신호 데이터를 입력 받아 시간-주파수 변환 기법을 이용하여 1차 증강을 수행하고, 제안된 W2GAN-GP 모델을 이용하여 2차 증강을 통한 듀얼 오디오 신호 증강 기법을 제안한다. 또한 제안한 증강기법으로 생성된 오디오 데이터의 유효성을 검증하기 위해서 ResNet50 및 DenseNet 분류 모델을 이용하여 분류 정확도를 측정하였다. 분류 모델을 통한 검증 결과, 증강을 수행하지 않은 경우보다 약 27~30% 의 정확도가 높아진 것을 확인할 수 있었다.

Recently, as deep learning technology is being utilized in classification systems in various fields, research is being actively conducted to maximize the performance of deep learning models. The performance of deep learning models is greatly affected by the amount and quality of training data, and the deeper and more complex the design of a deep learning model, the more training data is required. Additionally, if training data is insufficient or there is data imbalance between classes, over-fitting occurs and performance deteriorates. In order to improve classification performance in fields that utilize voice and audio, learning data expansion and imbalance between classes are important issues, and research to resolve these issues is essential. In this paper, we propose a W2GAN-GP augmentation model that effectively augments one-dimensional audio signals by improving the WGAN model designed for two-dimensional image augmentation. We propose a dual audio signal augmentation technique by performing the first augmentation on the original signal data using the time-frequency transform technique, and then performing the second augmentation using the proposed W2GAN-GP model. To verify the validity of the audio data generated by the proposed augmentation technique, classification accuracy was measured using ResNet50 and DenseNet classification models. As a result of verification through the classification model, it was confirmed that the accuracy increased by about 27 to 30% compared to the case where augmentation was not performed.

Keywords

Audio augmentation

WGAN-GP

Time-frequency transformation

Speech classification

Imbalanced data

References

L. Perez and J. Wang. "The effectiveness of data augmentation in image classification using deep learning", CoRR, pp. 1-8, 2017.
O. Abayomi-Alli, R. Damaševičius, A. Qazi, M. Adedoyin-Olowe, and S. Misra, "Data augmentation and deep learning methods in sound classification: a systematic review", Electronics.. Basel : MDPI., Vol. 11. Issue 22, No. 3795, pp. 1-32, 2022.
10.3390/electronics11223795
H. Chu, Y. Zhang, and H. Chiang, "A CNN Sound Classification Mechanism Using Data Augmentation", Sensors (Basel), Vol. 23, No. 15, 6972, 2023.
10.3390/s23156972
W. Han, Z. Zhang, Y. Zhang, J. Yu, C.C. Chiu, J Qin, A. Gulati, R. Pang, and Y. Wu, "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context", In Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Shanghai, pp. 3610-3614, 2020.
10.21437/Interspeech.2020-2059
M. Orken, O. Dina, A. Keylan, T. Tolganay, and O. Mohamed, "A study of transformer-based end-to-end speech recognition system for Kazakh language", Sci. Rep, Vol. 12, 8337, 2022.
10.1038/s41598-022-12260-y
Y. Zhang, "Music Recommendation System and Recommendation Model Based on Convolutional Neural Network", Mob. Inf. Syst. 3387598, 2022.
10.1155/2022/3387598
K. Huang, H. Qin, X. Zhang, and H. Zhang, "Music recommender system based on graph convolutional neural networks with attention mechanism", Neural Netw, Vol. 135, pp. 107-117, 2021.
G. Marc, M. Damian, "Environmental sound monitoring using machine learning on mobile devices", Appl. Acoust, Vol. 159, 107041, 2020.
10.1016/j.apacoust.2019.107041
A. Nogueira, H. Oliveira, J. Machado, and J. Tavares, "Sound Classification and Processing of Urban Environments: A Systematic Literature Review", Sensors, Vol. 22, 8608, 2022.
10.3390/s22228608
T. Nishida, K. Dohi, T. Endo, M. Yamamoto, and Y. Kawaguchi, "Anomalous Sound Detection Based on Machine Activity Detection", In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), pp. 269-273, 2022.
10.23919/EUSIPCO55093.2022.9909901
Y. Wang, Y. Zheng, Y. Zhang, Y. Xie, S. Xu, Y. Hu, and L. He, "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Using Classification-Based Methods", Appl. Sci. Vol. 11, 11128, 2021.
10.3390/app112311128
M. Crocco, M. Cristani, and A. Trucco, "Murino, V. Audio surveillance: A systematic review", ACM Comput. Surv. Vol. 48, pp. 1-46, 2016.
10.1145/2871183
Y. Leng, W. Zhao, C. Lin, C. Sun, R. Wang, Q. Yuan, and D. Li, "LDA-based data augmentation algorithm for acoustic scene classification", Knowl.-Based Syst. Vol. 195, 105600, 2020.
10.1016/j.knosys.2020.105600
M. Lech, M. Stolar, C. Best, and R. Bolia, "Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding", Frontiers in Computer Science, Vol. 2, No. 14, pp. 1-14, 2020.
10.3389/fcomp.2020.00014
N. Takahashi, M. Gygli, and L. Van Gool, "AENet: Learning deep audio features for video analysis", IEEE Trans. Vol. 20, pp. 513-524, 2018.
10.1109/TMM.2017.2751969
L. Haohe, C. Zehua, Y. Yi, M. Xinhao, L. Xubo, M. Danilo, W. Wenwu, and D. P. Mark, "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models", International Conference on Machine Learning, pp. 1-25, 2023.
H. Alonso, M. Barragán Pulido, J. Gil Bordón, M. Ferrer Ballester, C. Travieso González, "Speech evaluation of patients with alzheimer's disease using an automatic interviewer", Expert Syst. Vol. 192, 6386, 2022.
10.1016/j.eswa.2021.116386
Y. Jeong, J. Kim, D. Kim, J. Kim, and K. Lee, "Methods for improving deep learning-based cardiac auscultation accuracy: Data augmentation and data generalization", Appl. Sci. Vol. 11, 4544, 2021.
10.3390/app11104544
Y. Sun, A. Wong, and M. Kamel, "Classification of Imbalanced Data: A Review", Int. J. Pattern Recognit. Vol. 23, pp. 687-719, 2009.
10.1142/S0218001409007326
L. Ferreira-Paiva, E. Alfaro-Espinoza, V. Almeida, L. Felix, and R. Neves, "A Survey of Data Augmentation for Audio Classification", Sociedade Brasileira de Automática (SBA), Vol. 3, No. 1, pp. 2165-2172, 2022.
10.20906/CBA2022/3469
H. Purwins, B. Li, T. Virtanen, J. Schlüter, S. Y. Chang, and T. Sainath, "Deep Learning for Audio Signal Processing", in IEEE Journal of Selected Topics in Signal Processing, Vol. 13, No. 2, pp. 206-219, 2019.
10.1109/JSTSP.2019.2908700
D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", Proc. Interspeech 2019, pp. 2613-2617, 2019.
10.21437/Interspeech.2019-2680
A. Jain, P.R. Samala, D. Mittal, P. Jyothi, and M. Singh, "SPLICEOUT: A Simple and Efficient Audio Augmentation Method", Processing Interspeech, pp. 2678-2682, 2022.
10.21437/Interspeech.2022-572
S. Yun, D. Han, S. Chun, S. Oh, Y. Yoo, and J. Choe, "Cutmix: Regularization strategy to train strong classifiers with localizable features", In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6022-6031, 2019.
10.1109/ICCV.2019.00612
G. Kim, D. K. Han, and H. Ko, "SpecMix: Data Augmentation for Speech Recognition", Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 1, pp. 6-10, 2021.
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "Mixup: Beyond empirical risk minimization", In International Conference on Learning Representations(ICLR), pp. 1-13, 2018.
C. Donahue, J. McAuley, and M. Puckette, "Adversarial Audio Synthesis", International Conference on Learning Representations (ICLR), pp. 1-15, 2018.
Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial networks", in Advances in Neural Information Processing Systems(NIPS), pp. 2672-2680, 2014.
K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville, "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis", Advances in Neural Information Processing Systems (NeurIPS), 14910-14921, 2019.
H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "CycleGAN-VC: Non-parallel Voice Conversion using Cycle-Consistent Adversarial Networks," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5279-5283, 2018.
10.1109/ICASSP.2018.8462342
R. Lengyel, E. Moliner, M. Zwicker and T. Gerkmann, "StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6034-6038, 2021.
Carlo Aironi, Samuele Cornell, Luca Serafini, Stefano Squartini, "A Time-Frequency Generative Adversarial based method for Audio Packet Loss Concealment", European Signal Processing Conference, pp.1-5, 2023.
10.23919/EUSIPCO58844.2023.10290027
R. Prenger, R. Valle, and B. Catanzaro, "WaveGlow: A Flow-based Generative Network for Speech Synthesis", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617-3621, 2019.
10.1109/ICASSP.2019.8683143
R. Yamamoto, E. Song, and J. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.
10.1109/ICASSP40776.2020.9053795
J. Kong, J. Kim, and J. Bae, "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis", Conference on Neural Information Processing Systems (NeurIPS 2020), pp. 1-14, 2020.
J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, "GANSynth: Adversarial Neural Audio Synthesis", International Conference on Learning Representations (ICLR), pp. 1-17, 2019.
M. Arjovsky, S. Chintala, and L. Bottou, "Wasserstein GAN", Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 214-223, 2017.
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, "Improved training of wasserstein GANs", In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), pp. 5769 - 5779, 2017.
T. Karras, S. Laine, and T. Aila, "A Style-Based Generator Architecture for Generative Adversarial Networks", in IEEE Transactions on Pattern Analysis & Machine Intelligence, Vol. 43, No. 12, pp. 4217-4228, 2021.
10.1109/TPAMI.2020.2970919
P. Isola, J. Zhu, T. Zhou, and A. A. Efros, "Image-to-Image Translation with Conditional Adversarial Networks", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967-5976, 2017.
10.1109/CVPR.2017.632
M. Mirza and O. Simon, "Conditional Generative Adversarial Nets", arXiv preprint arXiv, pp. 1-7, 2014.
D. P. Kingma and P. Dhariwal, "Glow: Generative flow with invertible 1x1 convolutions", Advances in Neural Information Processing Systems (NeurIPS 2018), pp. 1-10, 2018.
A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio", Speech Synthesis Workshop, pp. 1-15, 2016.
S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, "SampleRNN: An unconditional end-to-end neural audio generation model", ICLR 2017.
C. Donahue, J. McAuley, and M. Puckette, "Adversarial Audio Synthesis", International Conference on Learning Representations.

Information

Publisher :The Society of Convergence Knowledge
Publisher(Ko) :융복합지식학회
Journal Title :The Society of Convergence Knowledge Transactions
Journal Title(Ko) :융복합지식학회논문지
Volume : 12
No :3
Pages :81-103
DOI :https://doi.org/10.22716/sckt.2024.12.3.007

[1] L. Perez and J. Wang. "The effectiveness of data augmentation in image classification using deep learning", CoRR, pp. 1-8, 2017.

[2] O. Abayomi-Alli, R. Damaševičius, A. Qazi, M. Adedoyin-Olowe, and S. Misra, "Data augmentation and deep learning methods in sound classification: a systematic review", Electronics.. Basel : MDPI., Vol. 11. Issue 22, No. 3795, pp. 1-32, 2022.
10.3390/electronics11223795

[3] H. Chu, Y. Zhang, and H. Chiang, "A CNN Sound Classification Mechanism Using Data Augmentation", Sensors (Basel), Vol. 23, No. 15, 6972, 2023.
10.3390/s23156972

[4] W. Han, Z. Zhang, Y. Zhang, J. Yu, C.C. Chiu, J Qin, A. Gulati, R. Pang, and Y. Wu, "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context", In Proceedings of the Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Shanghai, pp. 3610-3614, 2020.
10.21437/Interspeech.2020-2059

[5] M. Orken, O. Dina, A. Keylan, T. Tolganay, and O. Mohamed, "A study of transformer-based end-to-end speech recognition system for Kazakh language", Sci. Rep, Vol. 12, 8337, 2022.
10.1038/s41598-022-12260-y

[6] Y. Zhang, "Music Recommendation System and Recommendation Model Based on Convolutional Neural Network", Mob. Inf. Syst. 3387598, 2022.
10.1155/2022/3387598

[7] K. Huang, H. Qin, X. Zhang, and H. Zhang, "Music recommender system based on graph convolutional neural networks with attention mechanism", Neural Netw, Vol. 135, pp. 107-117, 2021.

[8] G. Marc, M. Damian, "Environmental sound monitoring using machine learning on mobile devices", Appl. Acoust, Vol. 159, 107041, 2020.
10.1016/j.apacoust.2019.107041

[9] A. Nogueira, H. Oliveira, J. Machado, and J. Tavares, "Sound Classification and Processing of Urban Environments: A Systematic Literature Review", Sensors, Vol. 22, 8608, 2022.
10.3390/s22228608

[10] T. Nishida, K. Dohi, T. Endo, M. Yamamoto, and Y. Kawaguchi, "Anomalous Sound Detection Based on Machine Activity Detection", In Proceedings of the 2022 30th European Signal Processing Conference (EUSIPCO), pp. 269-273, 2022.
10.23919/EUSIPCO55093.2022.9909901

[11] Y. Wang, Y. Zheng, Y. Zhang, Y. Xie, S. Xu, Y. Hu, and L. He, "Unsupervised Anomalous Sound Detection for Machine Condition Monitoring Using Classification-Based Methods", Appl. Sci. Vol. 11, 11128, 2021.
10.3390/app112311128

[12] M. Crocco, M. Cristani, and A. Trucco, "Murino, V. Audio surveillance: A systematic review", ACM Comput. Surv. Vol. 48, pp. 1-46, 2016.
10.1145/2871183

[13] Y. Leng, W. Zhao, C. Lin, C. Sun, R. Wang, Q. Yuan, and D. Li, "LDA-based data augmentation algorithm for acoustic scene classification", Knowl.-Based Syst. Vol. 195, 105600, 2020.
10.1016/j.knosys.2020.105600

[14] M. Lech, M. Stolar, C. Best, and R. Bolia, "Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding", Frontiers in Computer Science, Vol. 2, No. 14, pp. 1-14, 2020.
10.3389/fcomp.2020.00014

[15] N. Takahashi, M. Gygli, and L. Van Gool, "AENet: Learning deep audio features for video analysis", IEEE Trans. Vol. 20, pp. 513-524, 2018.
10.1109/TMM.2017.2751969

[16] L. Haohe, C. Zehua, Y. Yi, M. Xinhao, L. Xubo, M. Danilo, W. Wenwu, and D. P. Mark, "AudioLDM: Text-to-Audio Generation with Latent Diffusion Models", International Conference on Machine Learning, pp. 1-25, 2023.

[17] H. Alonso, M. Barragán Pulido, J. Gil Bordón, M. Ferrer Ballester, C. Travieso González, "Speech evaluation of patients with alzheimer's disease using an automatic interviewer", Expert Syst. Vol. 192, 6386, 2022.
10.1016/j.eswa.2021.116386

[18] Y. Jeong, J. Kim, D. Kim, J. Kim, and K. Lee, "Methods for improving deep learning-based cardiac auscultation accuracy: Data augmentation and data generalization", Appl. Sci. Vol. 11, 4544, 2021.
10.3390/app11104544

[19] Y. Sun, A. Wong, and M. Kamel, "Classification of Imbalanced Data: A Review", Int. J. Pattern Recognit. Vol. 23, pp. 687-719, 2009.
10.1142/S0218001409007326

[20] L. Ferreira-Paiva, E. Alfaro-Espinoza, V. Almeida, L. Felix, and R. Neves, "A Survey of Data Augmentation for Audio Classification", Sociedade Brasileira de Automática (SBA), Vol. 3, No. 1, pp. 2165-2172, 2022.
10.20906/CBA2022/3469

[21] H. Purwins, B. Li, T. Virtanen, J. Schlüter, S. Y. Chang, and T. Sainath, "Deep Learning for Audio Signal Processing", in IEEE Journal of Selected Topics in Signal Processing, Vol. 13, No. 2, pp. 206-219, 2019.
10.1109/JSTSP.2019.2908700

[22] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", Proc. Interspeech 2019, pp. 2613-2617, 2019.
10.21437/Interspeech.2019-2680

[23] A. Jain, P.R. Samala, D. Mittal, P. Jyothi, and M. Singh, "SPLICEOUT: A Simple and Efficient Audio Augmentation Method", Processing Interspeech, pp. 2678-2682, 2022.
10.21437/Interspeech.2022-572

[24] S. Yun, D. Han, S. Chun, S. Oh, Y. Yoo, and J. Choe, "Cutmix: Regularization strategy to train strong classifiers with localizable features", In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6022-6031, 2019.
10.1109/ICCV.2019.00612

[25] G. Kim, D. K. Han, and H. Ko, "SpecMix: Data Augmentation for Speech Recognition", Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 1, pp. 6-10, 2021.

[26] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, "Mixup: Beyond empirical risk minimization", In International Conference on Learning Representations(ICLR), pp. 1-13, 2018.

[27] C. Donahue, J. McAuley, and M. Puckette, "Adversarial Audio Synthesis", International Conference on Learning Representations (ICLR), pp. 1-15, 2018.

[28] Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, "Generative adversarial networks", in Advances in Neural Information Processing Systems(NIPS), pp. 2672-2680, 2014.

[29] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville, "MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis", Advances in Neural Information Processing Systems (NeurIPS), 14910-14921, 2019.

[30] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "CycleGAN-VC: Non-parallel Voice Conversion using Cycle-Consistent Adversarial Networks," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5279-5283, 2018.
10.1109/ICASSP.2018.8462342

[31] R. Lengyel, E. Moliner, M. Zwicker and T. Gerkmann, "StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization," IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6034-6038, 2021.

[32] Carlo Aironi, Samuele Cornell, Luca Serafini, Stefano Squartini, "A Time-Frequency Generative Adversarial based method for Audio Packet Loss Concealment", European Signal Processing Conference, pp.1-5, 2023.
10.23919/EUSIPCO58844.2023.10290027

[33] R. Prenger, R. Valle, and B. Catanzaro, "WaveGlow: A Flow-based Generative Network for Speech Synthesis", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3617-3621, 2019.
10.1109/ICASSP.2019.8683143

[34] R. Yamamoto, E. Song, and J. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.
10.1109/ICASSP40776.2020.9053795

[35] J. Kong, J. Kim, and J. Bae, "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis", Conference on Neural Information Processing Systems (NeurIPS 2020), pp. 1-14, 2020.

[36] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, "GANSynth: Adversarial Neural Audio Synthesis", International Conference on Learning Representations (ICLR), pp. 1-17, 2019.

[37] M. Arjovsky, S. Chintala, and L. Bottou, "Wasserstein GAN", Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 214-223, 2017.

[38] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, "Improved training of wasserstein GANs", In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), pp. 5769 - 5779, 2017.

[39] T. Karras, S. Laine, and T. Aila, "A Style-Based Generator Architecture for Generative Adversarial Networks", in IEEE Transactions on Pattern Analysis & Machine Intelligence, Vol. 43, No. 12, pp. 4217-4228, 2021.
10.1109/TPAMI.2020.2970919

[40] P. Isola, J. Zhu, T. Zhou, and A. A. Efros, "Image-to-Image Translation with Conditional Adversarial Networks", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5967-5976, 2017.
10.1109/CVPR.2017.632

[41] M. Mirza and O. Simon, "Conditional Generative Adversarial Nets", arXiv preprint arXiv, pp. 1-7, 2014.

[42] D. P. Kingma and P. Dhariwal, "Glow: Generative flow with invertible 1x1 convolutions", Advances in Neural Information Processing Systems (NeurIPS 2018), pp. 1-10, 2018.

[43] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio", Speech Synthesis Workshop, pp. 1-15, 2016.

[44] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio, "SampleRNN: An unconditional end-to-end neural audio generation model", ICLR 2017.

[45] C. Donahue, J. McAuley, and M. Puckette, "Adversarial Audio Synthesis", International Conference on Learning Representations.

The Society of Convergence Knowledge Transactions ISSN:2287-8920(Print) 융복합지식학회논문지

All Issue