Research Article
-
10.1109/ICASSP.2013.6638947A. Graves, A.-R. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645-6649, 2013.
-
10.21437/Interspeech.2009-103B. Schuller, S. Steidl and A. Batliner, “The interspeech 2009 emotion challenge,” Proceedings of INTERSPEECH, pp. 312-315, 2009.
-
Z. Zhang, E. Coutinho, J. Deng, and B. Schuller, “Cooperative learning and its application to emotion recognition from speech,” IEEE Signal Processing Magazine, Vol. 34, No. 4, pp. 54-65, 2017.
-
A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification,” Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1-5, 2018.
-
10.1109/TASLP.2018.2877258Y. Koizumi, S. Saito, H. U. Y. Kawachi, and N. Harada, “Unsupervised detection of anomalous sound based on deep learning and the Neyman-Pearson lemma,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 27, No. 1, pp. 212-224, 2019.
-
10.33682/w13e-5v06E. Fonseca, M. Plakal, F. Font, P. W. Ellis Daniel, and X. Serra,“Audio tagging with noisy labels and minimal supervision,” Proceedings of the Detection and Classification of Acoustic Scenes and Events (DCASE), pp. 1-5, 2019.
-
10.1109/ICASSP.2015.7178964J. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210, 2015.
-
F. Ahmad, A. Shah, and J. Kamruzzaman, “Deep learning based classification of underwater acoustic signals,” Procedia Computer Science, vol. 227, pp. 256-263, 2024.
-
A. Tran, T. D. Nguyen, and Q. V. Nguyen, “A comprehensive survey and yaxonomy on privacy-preserving deep learning,” Neural Networks, Vol. 174, pp. 428-460, 2024.
-
10.1109/TASLP.2024.3402049Y. Sun, K. Xu, C. Liu, Y. Dou, H. Wang, and B. Ding, “Automated data augmentation for audio classification,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 32, pp. 2716-2728, 2024.
-
S. Latif, R. Rana, S. Younis, J. Qadir, and J. Epps, “Speech emotion recognition using deep learning techniques: A review,” IEEE Transactions on Affective Computing, Vol. 11, No. 4, pp. 652-672, 2020.
-
C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” International Conference on Learning Representations (ICLR), pp. 1-16, 2019.
-
A. Vaswani,N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Neural Information Processing Systems (NIPS 2017), pp. 1-11, 2017.
-
10.1109/CVPR.2016.90K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778, 2016.
-
10.1007/978-3-030-03243-2_860-1T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” Proceedings of ICLR, pp. 1-26, 2018.
-
10.1109/TASLP.2024.3402049Y. Sun, K. Xu, C. Liu, Y. Dou, H. Wang, and B. Ding, “Automated data augmentation for audio classification," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 32, pp. 2716-2728, 2024.
-
K. M. Rezau, M. Jewel, S. Islam, and K. Noor, “Enhancing audio classification through MFCC feature extraction and data augmentation with CNN and RNN models,” International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 2, No. 2, pp. 63-84, 2024.
-
10.1109/ICME52920.2022.9859621A. Fathan, J. Alam, and W. H. Kang, “Mel-spectrogram image-based end-to-end audio deepfake detection under channel-mismatched conditions,” 2022 IEEE International Conference on Multimedia and Expo (ICME), pp. 1-6, 2022.
-
10.1109/ICMERALDA60125.2023.10458162S. D. Handy Permana, and T. K. A. Rahman, “Improved feature extraction for sound recognition using Combined Constant-Q Transform (CQT) and mel spectrogram for CNN Input,” 2023 International Conference on Modeling & E-Information Research, Artificial Learning and Digital Applications (ICMERALDA), pp. 185-190, 2023.
-
10.21437/Interspeech.2019-2680D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” International Journal of Advanced Computer Science and Applications (IJACSA), Proceedings of INTERSPEECH 2019, pp. 2613-2617, 2019.
-
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “Mixup: Beyond empirical risk minimization,” International Conference on Learning Representations (ICLR), pp. 1-8, 2018.
-
10.1109/ICCV.2019.00612S. Yun, D. Han, S. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization strategy to train strong cassifiers with localizable features,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6022-6031, 2019.
-
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in Neural Information Processing Systems (NeurIPS), pp. 1-9, 2014.
-
D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference on Learning Representations (CoRR), pp. 1-14, 2013.
-
10.21437/Interspeech.2021-173X. Xie, R. Ruzi, X. Liu, and L. Wang, “Variational auto-encoder based variability encoding for dysarthric speech recognition,” Proceedings of INTERSPEECH 2021, pp. 4808-4812, 2021.
-
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems (NeurIPS), pp. 6840-6851, 2020.
-
S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “Generalization and equilibrium in generative adversarial nets (GANs),” International Conference on Machine Learning (ICML), pp. 1-27, 2017.
-
K. Kumar, R. Kumar, T. Boissiere, L. Gestin, W. Teoh, J. Sotelo, A. Brebisson, Y. Bengio, and A. Courville, “MelGAN: Generative adversarial networks for conditional waveform synthesis,” 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), pp. 1-14, 2019.
-
10.1109/ICASSP40776.2020.9053795R. Yamamoto, E. Song, and J. -M. Kim, “Parallel Wavegan: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199-6203, 2020.
-
J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” 34th Conference on Neural Information Processing Systems (NeurIPS 2020), pp. 1-14, 2020.
-
A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” International Conference on Learning Representations (ICLR), pp. 1-16, 2016.
-
10.1109/TASSP.1984.1164317D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 32, No. 2, pp. 236-243, 1984.
-
10.1109/WASPAA.2013.6701851N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast Griffin-Lim algorithm,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1-4, 2013.
-
S. G. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” International Conference on Learning Representations (ICLR), pp. 1-20, 2023.
-
O. Oktay, J. Schlemper, L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, B. Glocker, and D. Rueckert, “Attention U-Net: Learning where to look for the pancreas,” Medical Imaging with Deep Learning (MIDL), pp. 1-10, 2018.
-
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, “Improved training of wasserstein GANs,” Advances in Neural Information Processing Systems (NeurIPS), Vol. 30, pp. 5767-5777, 2017.
-
10.1109/34.192463S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and machine Intelligence (PAMI), Vol. 11, No. 7, pp. 764-693, 1989.
-
L. R. Rabinaer and R. W. Schafer, Theory and applications of digital speech processing, Pearson, 2010.
-
10.1016/j.eswa.2021.115270M. Irfan, Z. Jiangbin, S. Ali, M. Iqbal, Z Masood, and U. Hamid, “DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification,” Expert Systems with Applications, Vol. 183, 115270, pp. 2021.
-
10.21437/Interspeech.2005-446F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database of german emotional speech,” in 9th European Conference on Speech Communication and Technology, pp. 1-4, 2005.
-
10.1109/TSA.2002.800560G. Tzanetakis, and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on Speech and Audio Processing, Vol. 10, No. 5, pp. 293-302, 2002.
-
10.1145/2733373.2806390Karol J. Piczak, “ESC: Dataset for environmental sound classification,” In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015-1018, 2015.
- Publisher :The Society of Convergence Knowledge
- Publisher(Ko) :융복합지식학회
- Journal Title :The Society of Convergence Knowledge Transactions
- Journal Title(Ko) :융복합지식학회논문지
- Volume : 14
- No :1
- Pages :99-120
- DOI :https://doi.org/10.22716/sckt.2026.14.1.009


The Society of Convergence Knowledge Transactions






