Document Type : Review Article

Authors

International Islamic University Malaysia, Electrical and Computer Engineering, Malaysia.

Abstract

Emotion speech recognition (SER) is to study the formation and change of speaker’s emotional state from his/her speech signal. The main purpose of this field is to produce a convenient system that is able to effortlessly communicate and interact with humans. The reliability of the current speech emotion recognition systems is far from being achieved. However, this is a challenging task due to the gap between acoustic features and human emotions, which rely strongly on the discriminative acoustic features extracted for a given recognition task. Deep Learning techniques have been recently proposed as an alternative to traditional techniques in SER. In this paper, an overview of Deep Learning techniques that could be used in Emotional Speech recognition is presented. Different extracted features like MFCC as well as feature classifications methods like HMM, GMM, LTSTM and ANN were discussion.  Also, the review covers databases used, emotions extracted, contributions made toward speech emotion recognition

Keywords

[1] M. A. Mahjoub, M. Mbarki, Y. Serrestou, L. Kerkeni, and K. Raoof, “Speech Emotion Recognition: Methods and Cases Study,” Vol. 2, Icaart, pp. 175–182, 2018.
[2] Po-Yuan Shih and Chia-Ping Chen, “Speech Emotion Recognition with Skew-Robust Neural Networks”, Computer Science and Engineering Kaohsiung, Taiwan ROC Academia Sinica Institute of Information Science, pp. 2751–2755, 2017.
[3] Li Zheng; Qiao Li; Hua Ban and Shuhua Liu, “Speech emotion recognition based on convolution neural network combined with random forest”, 2018 Chinese Control And Decision Conference (CCDC), 2018..
[4] Lin Yilin and Wei Gang, "Speech Emotion Recognition Based on HMM and SVM", Proc of the 4th International Conference on Machine Learning and Cybernetics, Vol. VIII, pp. 4898-4901, 2005.
[5] W Lim, D Jang and T. Lee, "Speech emotion recognition using convolutional and Recurrent Neural Networks[C]", Signal and Information Processing Association Annual Summit and Conference (APSIPA) 2016 Asia-Pacific, pp. 1-4, 2016.
[6] M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for Speech Emotion Recognition,” Neural Networks, Vol. 92, pp. 60–68, 2017.
[7] J. Engelbart, “A Real-Time Convolutional Approach To Speech Emotion Recognition,” July, 2018.
[8] Y. Niu, D. Zou, Y. Niu, Z. He, and H. Tan, “A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks,” pp. 1–7.
[9] J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D & 2D CNN LSTM networks,'' Biomed. Signal Process. Control, Vol. 47, pp. 312_323, Jan
[10] C. Z. Seyedmahdad Mirsamadi, Emad Barsoum, “Automatic Speech Emotion Recognition Using Recurrent Neural Networks With Local Attention Center for Robust Speech Systems , The University of Texas at Dallas , Richardson , TX 75080 , USA Microsoft Research , One Microsoft Way , Redmond , WA 98052 , USA,” ICASSP 2017, Proc. 42nd IEEE Int. Conf. Acoust. Speech, Signal Process., pp. 2227–2231, 2017.
[11] P. Tzirakis, J. Zhang, and W. Schuller, “END-TO-END SPEECH EMOTION RECOGNITION USING DEEP NEURAL NETWORKS”, Department of Computing , Imperial College London , London , UK Chair of Embedded Intelligence for Health Care and Wellbeing , University of Augsburg , Germany, Icassp 2018, pp. 5089–5093, 2018.
[12] J. A. Russell, “A Circumplex Model of Affect,” December 1980, 2016.
[13] T. A. Burns, “The Nature of Emotions,” Int. J. Philos. Stud., Vol. 27, No. 1, pp. 103–106, 2019.
[14] Lin Yilin and Wei Gang, "Speech Emotion Recognition Based on HMM and SVM", Proc of the 4th International Conference on Machine Learning and Cybernetics, Vol. VIII, pp. 4898-4901, 2005.
[15] M. Fayek, M. Lech, and L. Cavedon, “Evaluating deep learning architectures for Speech Emotion Recognition,” Neural Networks, Vol. 92, pp. 60–68, 2017.
[16] J. Engelbart, “A Real-Time Convolutional Approach To Speech Emotion Recognition,” July, 2018.
[17] S. Demircan and H. Kahramanlı, “Feature Extraction from Speech Data for Emotion Recognition,” January, 2014.
[18] P. P. Singh and P. Rani, “An Approach to Extract Feature using MFCC,” Vol. 04, no. 08, pp. 21–25, 2014.
[19] S. Bhadra, U. Sharma, and A. Choudhury, “Study on Feature Extraction of Speech Emotion Recognition,” Vol. 4, No. 1, pp. 3–5, 2016.
[20] K. Han, D. Yu, and I. Tashev, “Speech Emotion Recognition Using Deep Neural Network and Extreme Learning Machine,” Fifteenth Annu. Conf. …, September, pp. 223–227, 2014.
[21] O. Ameena, “Speech Emotion Recognition Using Hmm, Gmm and Svm Models B.Sujatha (Me),” Int. J. Prof. Eng. Stud., Vol. VI, No. 3, pp. 311–318, 2016.
[22] Y. Sun, “Neural Networks for Emotion Classification,” Science (80-. )., Vol. 20, No. 9, pp. 886–899, 2011.
[23] B. Xu, N. Wang, and T. Chen, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
[24] M. A. . Tanner and W. H. Wong , “The Calculation of Posterior Distributions by Data Augmentation”, Journal of the American Statistical Association , Vol . 82 , No . 398 ( Jun ., 1987 ), pp. 528–540, 2009.
[25] N. Jaitly and G. E. Hinton, “Vocal Tract Length Perturbation (VTLP) improves speech recognition.”, Proceedings of the 30 th International Conference on Machine Learning, Atlanta, Georgia, USA, 2013.
[26] R. Y Rubinstein, D. P Kroese, “Imulation and the Monte Carlo Method”, 2nd edn. 2007 Wiley, New York.
[27] Y. Q. K. Y. J. Niu, "Acoustic emotion recognition using deep neural network," Acoustic emotion recognition using deep neural network, pp. pp. 128-132, 2014.
[28] X. Zhou, J. Guo, and R. Bie, “Deep Learning Based Affective Model for Speech Emotion Recognition,” pp. 841–846, 2016.
[29] R. R. Salakhutdinov and G. E. Hinton. “Deep Boltzmann machines”, In Proceedings of the International Conference on Artificial Intelligence and Statistics, Vol. 12, 2009.
[30] F. Liu, B. Liu, C. Sun, M. Liu, X. Wang, “Deep belief network-based approaches for link prediction in signed social networks”, Entropy 17 pp. 2140–2169, 2015.
[31] G.E. Hinton, S. Osindero, Y.W. Teh, “A fast learning algorithm for deep belief nets”, Neural Comput 18, pp. 1527–1554, 2006.
[32] Yoshua Bengio. “Learning deep architectures for AI. Foundations and Trends in Machine Learning”, Also published as a book. Now Publishers, Vol. 2(1),1pp. –127, 2009.
[33] P. Vincent , “Hugo Larochelle Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion”, Journal of Machine Learning Research Vol. 11, pp. 3371-3408,2010.
[34] M. F. Alghifari, T. S. Gunawan, and M. Kartiwi, “Speech emotion recognition using deep feedforward neural network,” Indones. J. Electr. Eng. Comput. Sci., Vol. 10, No. 2, pp. 554–561, 2018.
[35] X. Ke, Y. Zhu, L. Wen, and W. Zhang, “Speech Emotion Recognition Based on SVM and ANN,” Int. J. Mach. Learn. Comput., Vol. 8, No. 3, pp. 198–202, 2018.
[36] P. Tzirakis, J. Zhang, and W. Schuller, “End-To-End Speech Emotion Recognition Using Deep Neural Networks Department of Computing”, Imperial College London , London , UK Chair of Embedded Intelligence for Health Care and Wellbeing , University of Augsburg , Germany, Icassp 2018, pp. 5089–5093, 2018.