Document Type : Review Article

Authors

Abstract

Variations of speech parameters due to emotion or stress are noticeable. In the presence of such variations, if a neutral model is used for the system, the speech recognition accuracy deteriorates. The evaluation of how emotion influences speech parameters is the first step towards emotional speech recognition. Pitch frequency is an important parameter in speech processing systems. Therefore in this research the effect of pitch frequency and its slope due to emotion is explored for voiced phonemes. On the other hand, the influence of emotional state on continuous speech recognition performance is evaluated. The results show that the recognition performance of sentences with angry and happiness states and also interrogative sentences has the most deterioration. This deterioration is more than 68% when compared to neutral speech recognition accuracy. To improve recognition results, we add the pitch frequency information to the end of speech recognizer feature vector. The amount of improvement depends on the type of emotion and also added pitch information. The results show that, pitch frequency slope has a significant affect on the improvement of speech recognition accuracy even for neutral speech.   

Keywords

[1] Wang C., and Seneff S.; “Robust Pitch Tracking for Prosodic Modeling in Telephone Speech”, in Proc. .Int. Conf. .Acoustics, Speech, and Signal Processing, Vol. 3, pp. 1343 - 1346, (2000)
[2] Huang X., Acero A. and Hon H-W.; Spoken Language Processing, A Guide to Theory, Algorithm, and System Development, Prentice Hall, (2005)
[3] Shneilerman B.; “The Limits of Speech Recognition”, Communication of The ACM, Vol. 43, pp. 63 - 65, (2000)
[4] Yuan J., Shen L. and Chen F.; “The Acoustic Realization of Anger, Fear, Joy and Sadness in Chinese”, in Proc. 7th Int. Conf. Spoken Language Processing (ICSLP’02), pp. 2025 - 2028, (2002)
[5] Yuan J., Shih C. and Kochanski G.P.; “Comparison of Declarative and Interrogative Intonation in Chinese”, in Proc. Int Conf Speech Prosody, Aix-En-Provence, pp. 711 - 714, (2002)
[6] Gharavian D.; Prosody in Farsi Language and Its Use in Recognition of Intonation and Speech, Ph.D Thesis, Electrical Engineering Department, Amirkabir University of Technology, Tehran (In Farsi), (2004)
[7] Almasganj F.; Structural Analysis of Farsi Language Using Prosodic Information of the Speech Signal, Ph.D Thesis, Tarbiat Modarres University, Tehran (In Farsi), (1998)
[8] Athanaselist T., Bakamidis S., Dologlou I., Cowie R., Douslas E. and Cox C.; “ASR for Emotional Speech: Clarifying the Issues and Enhancing Performance”, Journal of Neural Networks, Vol. 18, pp. 437 - 444, (2005)
[9] Liu J., Zheng T.F. and Wu W.; “Pitch Mean Based Frequency Warping”, in Proc. Int. Symp. .Chinese Spoken Language Processing (ISCSLP’06), Vol. I, pp. 87 - 94, (2006)
[10] Cui X. and Alwan A.; “MLLR-Like Speaker Adaptation Based on Linearization of VTLN with MFCC Features”, in Proc. 9th Europ. Conf. Speech Communication and Technology (EUROSPEECH’05), pp. 273 - 276, (2005)
[11] Schuller B., Müller R., Eyben F., Gast J., Hِornler B., Wollner M., Rigoll G., Hothker A. and Konosu H.; “Being Bored? Recognising Natural Interest by Extensive Audiovisual Integration for Real-Life Application”, Image and Vision Computing, Vol. 27, pp. 1760 - 1774, (2009)
[12] Magimai M., Stephenson T.A. and Bourlard H.; “Using Pitch Frequency Information in Speech RecognitionC”, in Proc. 8th Europ. Conf. .Speech Communication and Technology (EUROSPEECH’03), pp. 2525-2528, (2003)
[13] Hirose K. and Iwano K.; “Detection of Prosodic Word Boundaries by Statistical Modeling of MORA Transitions of Fundamental Frequency Contours and its Use for Continuous Speech Recognition”, in Proc Int. Conf. .Acoustics, Speech, and Signal Processing, Vol. 3, pp. 133 - 136, (2000)
[14] Hung H.C.H. and Seide F.; “Pitch Tracking and Tone Features for Mandarin Speech Recognition”, in Proc. Int. Conf. .Acoustics, Speech, and Signal Processing, Vol. 3, pp. 1523 - 1526, (2000)
[15] Cernak M., Benzeghiba M. and Wellekens C.; “Diagnostics of Speech Recognition: On Evaluating Feature Set Performance”, in Proc. 12th Int. Conf. Speech and Computer (SPECOM’07), Vol. 1, pp. 188 - 193, (2007)
[16] Bijankhan M., Sheikhzadegan J., Roohani M.R., Samareh Y., Lucas C. and Tebiani M.; “The Speech Database of Farsi Spoken Language”, in Proc. 5th Australian Int.Conf. .Speech Science and Technology (SST'94), pp. 826-831, (1994)
[17] Young S.J., Evermann G., Kershaw D., Moore G., Odell J., Ollason D., Povey D., Valtchev V. and Woodland P.; The HTK Book, Cambridge University, (2001)
[18] Medan Y., Yail E. and Chazan D.; “Superresolution Pitch Determination of Speech Signals”, IEEE Trans. on Signal Processing, Vol. 39, pp. 40 - 48, (1991)
[19] Edinburgh Speech Tools Library, Available: http://estvox.org/docs/speech-tools-1.2.0/x2152.htm.
[20] Vepreka P. and Scordilis M.S., “Analysis, Enhancement and Evaluation of Five Pitch Determination Techniques”, Journal of Speech Communication, Vol. 37, pp. 249 - 270, (2002)
[21] Surendran D.; Analysis and Automatic Recognition of Tones in Mandarin Chinese, Ph.D Thesis, University of Chicago, (2007)