Document Type : Review Article

Authors

Assistant Professor

Abstract

Emotion has an important role in naturalness of man-machine communication. So, computerized emotion recognition from speech is investigated by many researchers in the recent decades. In this paper, the effect of formant-related features on improving the performance of emotion detection systems is experimented. To do this, various forms and combinations of the first three formants are concatenated to a popular feature vector and Gaussian mixture models are used as classifiers. Experimental results show average recognition rate of 69% in four emotional states and noticeable performance improvement by adding only one formant-related parameter to feature vector. The architecture of hybrid emotion recognition/spotting is also proposed based on the developed models. 

Keywords

[1] C. Clavel, I. Vasilescu, L. Devillers, G. Richard and T. Ehrette, “Fear-type emotion recognition for future audio-based surveillance systems”, Speech Communication, 50, pp. 487-503, (2008)
[2] Z. Inanoglu and S. Young, “Data-driven emotion conversion in spoken English”, Speech Communication , 51, pp. 268-283, (2009)
[3] E. Leon, G. Clarke, V. Callaghan and F. Sepulveda, “A user-independent real-time emotion recognition system for software agents in domestic environments”, Engineering Applications of Artificial Intelligence, 20, pp. 337-345, (2007)
[4] D. Morrison, R. Wang and L.C. de Silva, “Ensemble methods for spoken emotion recognition in call-centers”, Speech Communication, 49, pp. 98-112, (2007)
[5] M. Sheikhan, M. Nasirzadeh and A. Daftarian, “Design and implementation of Farsi text to speech system”, Journal of Engineering Faculty, Ferdowsi University of Mash’had, 17, pp. 31-48, (2005)
[6] M. Sheikhan, “Automatic prosody generation by neural-statistical hybrid model for unit selection speech synthesis”, Journal of Biomedical Engineering, 1(new), pp. 227-240, (2007)
[7] M. Sheikhan, “Prosody generation in Farsi language”, In the Proceedings of International Symposium on Telecommunications, pp. 250-253, (2003)
[8] M. Sheikhan, M. Nasirzadeh and A. Daftarian, “Text to speech for Iranian dialect of Farsi language”, In the Proceedings of Second Workshop on Farsi Computer Speech, University of Tehran, pp. 39-53, (2006)
[9] M. Sheikhan, M. Tebyani and M. Lotfizad, “Continuous speech recognition and syntactic processing in Iranian Farsi language”, International Journal of Speech Technology, 1, pp. 135-141, (1997)
[10] D. Gharavian and S.M. Ahadi, “The effect of emotion on Farsi speech parameters: A statistical evaluation”, In the Proceedings of the International Conference on Speech and Computer, pp. 463-466, (2005)
[11] D. Gharavian and S.M. Ahadi, “Recognition of emotional speech and speech emotion in Farsi”, In the Proceedings of International Symposium on Chinese Spoken Language Processing, Vol. 2, pp. 299-308, (2006)
[12] D. Gharavian, “Prosody in Farsi language and its use in recognition of intonation and speech”, PhD Dissertation, Electrical Engineering Department, Amirkabir University of Technology, Tehran (In Farsi), (2004)
[13] F.J. Tolkmitt and K.R. Scherer, “Effect of experimentally induced stress on vocal parameters”, Journal of Experimental Psychology: Human Perception and Performance, 12, pp. 302-313, (1986)
[14] J.H.L. Hansen and D.A. Carins, “ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments”, Speech Communication, 16, pp. 391-422, (1995)
[15] D. Cairns and J.H.L. Hansen, “Nonlinear analysis and detection of speech under stressed conditions”, Journal of Acoustic Society of America , 96, pp. 3392-3400, (1994)
[16] B.D. Womack and J.H.L. Hansen, “Classification of speech under stress using target driven features”, Speech Communication, 20, pp. 131-150, (1996)
[17] H.Altun and G. Polat, “New frameworks to boost feature selection algorithms in emotion detection for improved human-computer interaction”, Brain Vision and Artificial Intelligent. Lecture Notes in Computer Science , 4729, pp. 533-541, (2007)
[18] H. Altun and G. Polat, “Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection”, Expert Systems with Applications, 36, pp. 8197-8203, (2009)
[19] C.M. Lee and S.S. Narayanan, “Toward detecting emotions in spoken dialogs”, IEEE Transactions on Speech and Audio Processing, 13, pp. 293-303, (2005)
[20] M. Shami and W. Verhelst, “An evaluation of the robustness of existing supervised machine learning approaches to the classifications of emotions in speech”, Speech Communication, 49, pp. 201-212, (2007)
[21] D. Ververidis and C. Kotropoulos, “Emotional speech recognition: Resources, features, and methods”, Speech Communication, 48, pp. 1162-1181, (2006)
[22] J. Rong, G. Li and Y.P. Chen, “Acoustic feature selection for automatic emotion recognition from speech”, Information Processing and Management, 45, pp. 315-328, (2009)
[23] V.A. Petrushin, “Emotion recognition in speech signal: Experimental study, development, and application”, In the Proceedings of the International Conference on Spoken Language Processing, pp. 222-225, (2000)
[24] N. Amir, “Classifying emotions in speech: A comparison of methods”, In the Proceedings of the European Conference on Speech Communication and Technology, pp. 127-130, (2001)
[25] L. Cai, C. Jiang, Z. Wang, L. Zhao and C. Zou, “A method combining the global and time series structure features for emotion recognition in speech”, In the Proceedings of the International Conference on Neural Networks and Signal Processing, Vol. 2, pp. 904-907, (2003)
[26] C.M. Lee and S. Narayanan, “Emotion recognition using a data-driven fuzzy inference system”, In the Proceedings of the European Conference on Speech Communication and Technology, pp. 157-160, (2003)
[27] B. Schuller, G. Rigoll and M. Lang, “Hidden Markov model-based speech emotion recognition”, In the Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 1-4, (2003)
[28] O.W. Kwon, K. Chan, J. Hao and T.W. Lee, “Emotion recognition by speech signal”, In the Proceedings of the European Conference on Speech Communication and Technology, pp. 125-128, (2003)
[29] R. Kohavi and G.H. John, “Wrappers for feature subset selection”, Artificial Intelligence, 97, pp. 273-324, (1997)
[30] J.B. Tenenbaum, V. de Silva and J.C. Langford, “A global geometric framework for nonlinear dimensionality reduction”, Science, pp. 2319-2323, (2000)
[31] F. Dellaert, T. Polzin and A. Waibel, “Recognizing emotion in speech”, In the Proceedings of the International Conference on Spoken Language Processing, Vol. 3, pp. 1970-1973, (1996)
[32] A. Hyvarinen, “Survey of independent component analysis”, Neural Computing Surveys, 2, pp. 94-128, (1999)
[33] H. Liu, H. Motoda and L. Yu, “Feature selection with selective sampling”, In the Proceedings of the International Conference on Machine Learning, pp. 395-402, (2002)
[34] L. Talavera, “Feature selection as a preprocessing step for hierarchical clustering”, In the Proceedings of the International Conference on Machine Learning, pp. 389-397, (1999)
[35] J. Han and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufman Pub. Comp., (2000)
[36] C.M. Lee, S. Narayanan and R. Pieraccini, “Combining acoustic and language information for emotion recognition”, In the Proceedings of the International Conference on Spoken Language Processing, pp. 873-876, (2002)
[37] J. Nicholson, K. Takahashi and R. Nakatsu, “Emotion recognition in speech using neural networks”, In the Proceedings of the International Conference on Neural Information Processing, Vol. 2, pp. 495-501, (1999)
[38] C.H. Park, D.W. Lee and K.B. Sim, “Emotion recognition of speech based on RNN”, In the Proceedings of the International Conference on Machine Learning and Cybernetics, Vol. 4, pp. 2210-2213, (2002)
[39] C.H. Park and K.B. Sim, “Emotion recognition and acoustic analysis from speech signal”, In the Proceedings of the International Joint Conference on Neural Networks, Vol. 4, pp. 2594-2598, (2003)
[40] Z.J. Chuang and C.H. Wu, “Emotion recognition using acoustic features and textual content”, In the Proceedings of the International Conference on Multimedia and Expo, Vol. 1, pp. 53-56, (2004)
[41] S. Hoch, F. Althoff, G. McGlaun and G. Rigooll, “Bimodal fusion of emotional data in an automotive environment”, In the Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol. 2, pp. 1085-1088, (2005)
[42] B. Schuller, G. Rigoll and M. Lang, “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture”, In the Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 577-580, (2004)
[43] L. Bosch, “Emotions, speech and the ASR framework”, Speech Communication, 40, pp. 213-225, (2003)
[44] T.L. Nwe, S.W. Foo and L.C. de Silva, “Speech emotion recognition using hidden Markov models”, Speech Communication, 41, pp. 603-623, (2003)
[45] M. Song, J. Bu, C. Chen and N. Li, “Audio-visual based emotion recognition- A new approach”, In the Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 1020-1025, (2004)
[46] M. Song, C. Chen and M. You, “Audio-visual based emotion recognition using tripled hidden Markov model”, In the Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, pp. 877-880, (2004)
[47] M. Bijankhan, J. Sheikhzadegan, M.R. Roohani, Y. Samareh, C. Lucas and M. Tebyani, “FARSDAT- The speech database of Farsi spoken language”, In the Proceedings of the Australian Conference on Speech Science and Technology, Vol. 2, pp. 826-830, (1994)
[48] S. Young, The HTK Book, Cambridge University Press, (2001)
[49] SS. McCandless, “An algorithm for formant extraction using linear prediction spectra”, IEEE Transactions on Acoustics, Speech and Signal Processing, 22, pp. 135-141, (1974)
[50] F. Yu, E. Chang, Y. Xu and H. Shum, “Emotion detection from speech to enrich multimedia content”, In Proceedings of the IEEE Pacific Rim Conference on Multimedia: Advances in Multimedia Information Processing, pp. 550-557, (2001)
[51] M. Ayadi, S. Kamel and F. Karray, “Speech emotion recognition using Gaussian mixture vector autoregressive models”, In the Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol. 5, pp. 957- 960, (2007)
[52] B. Vlasenko and. A. Wendemuth, “Tuning hidden Markov model for speech emotion recognition”, In Proceedings of the 33rd German Annual Conference on Acoustics, pp. 317-320, (2007)
[53] I. Luengo, E. Navas, I. Hernaez and J. Sanchez, “Automatic emotion recognition using prosodic parameters”, in the proc. of the European Conf. on Speech Communication and Technology, pp: 493-496, (2005)
[54] D. Neiberg, K. Elenius and K. Laskowski, “Emotion recognition in spontaneous speech using GMMs”, in the proc. of the Int. Conf. on Spoken Language Processing, pp: 809-812, (2006)
[55] A. Nazeriyeh, “Emotion speech recognition using prosody features by artificial neural networks in farsi language”, MSc. Dissertation, Islamic Azad University-South Tehran Branch (Advisor: D. Gharavian, Consultant: M. Sheikhan), (2010)
[56] G.A. Carpenter, S. Grossberg, N. Markuzon, J.H. Reynolds and D.D. Rosen, “A neural network architecture for incremental supervised learning of analog multidimensional maps”, IEEE Transaction on Neural Networks, 3:698-713, (1992)