Document Type : Review Article

Authors

Abstract

The speech spectrum is very sensitive to linear predictive coding (LPC) parameters, so small quantization errors may cause unstable synthesis filter. Line spectral pairs (LSPs) are more efficient representations than LPC parameters. On the other hand, artificial neural networks (ANNs) have been used successfully to improving the quality and also reduction the computational complexity of speech coders. This work proposes an efficient technique to reduce the bit rate of FS-1015 speech coder, while improving the performance. In this way, LSP parameters are used instead of the LPC parameters. In addition, neural vector quantizers based on Kohonen self-organizing feature map (KSOFM), with a modified-supervised training algorithm, and fuzzy ARTMAP are also employed to reduce the bit rate. By using the mentioned neural vector quantizer models, the quality of synthesized speech, in terms of mean opinion score (MOS), is improved 0.13 and 0.26, respectively. The execution time of proposed models, as compared to FS-1015 standard, is also reduced 27% and 43%, respectively.  

Keywords

[1] W. T. K. Wong, J. Joe, K. Joe, M. Joe; "Low Rate Speech Coding for Telecommunications", BT Technol. J., 14, pp. 28-43, 1996.
[2] F. Itakula; "Line Spectrum Representation of Linear Predictive Coefficients of Speech Signal", J. Acoust. Soc. Amer., 57, pp. 535(A), 1975.
[3] M. Hasegawa-Johnson; "Line Spectral Frequencies Are Poles and Zeros of the Glottal Driving-Point Impedance of a Discrete Matched-Impedance Vocal Tract Model", J. Acoust. Soc. Amer., 108, pp. 457-460, 2000.
[4] شیخان، منصور؛ پاینده، شهراد؛ رزاقیان، فرهاد؛ " سیستم بازشناسی و درک گفتار فارسی "، مجموعه مقالات دومین کنفرانس مهندسی برق ایران، صفحات 352-345، 1373.
[5] M. Sheikhan, M. Tebyani, M. Lotfizad; "Continuous Speech Recognition and Syntactic Processing in Iranian Farsi Language", Int. J. Speech Technology, 1, pp. 135-141, 1997.
[6] M. Sheikhan; "Suboptimum Extracted Features and Classifier for Speaker-Independent Farsi Digit Recognizer", Proc. Int. Symp. Telecomm. (IST2003), pp. 246-249, 2003.
[7] M. Sheikhan; "Prosody Generation in Farsi Language", Proc. Int. Symp. Telecomm. (IST2003), pp. 250-253, 2003.
[8] شیخان، منصور؛ نصیرزاده، مجید؛ دفتریان، علی؛ ”طراحی و پیاده-سازی سیستم تبدیل متن به‌گفتار طبیعی برای زبان فارسی“، مجله علمی – پژوهشی دانشکدة مهندسی دانشگاه فردوسی مشهد، سال 17، شمارة 2، صفحات 48-31، 1384.
[9] شیخان، منصور؛ "تولید خودکار نوای گفتار به‌کمک مدل آمیختار عصبی– آماری با امکان انتخاب واحد در سنتز"، مجلة علمی- پژوهشی مهندسی پزشکی زیستی، دورة جدید، شمارة اول، صفحات 240-227، 1386.
[10] شیخان، منصور؛ رزاقیان، فرهاد؛ بوذرجمهری، رضا؛ ”تلفیق شبکة عصبی با هوش‌مصنوعی جهت تفکیک، تصحیح، بررسی معنایی گفتار فارسی و تبدیل گفتار به متن نوشتاری“، مجموعه مقالات دومین کنفرانس کامپیوتر ایران، صفحات 129-121، 1372.
[11] شیخان، منصور؛ طبیانی، محمود؛ لطفی‌زاد، مجتبی؛ ”دسته‌بندی مفهومی و رفع ابهام معنایی کلمات فارسی توسط شبکه‌های عصبی“، مجموعه مقالات کنفرانس بین‌المللی سیستم‌های هوشمند و شناختی، صفحات 39-35، 1375.
[12] M. Sheikhan, M. Tebyani, M. Lotfizad; "Using Symbolic and Connectionist Approaches to Automate Editing Persian Sentences Syntacticly", Proc. Int. Conf. Intell. & Cogn. Syst., pp. 250-253, 1996.
[13] M. Birgmeier; "Nonlinear Prediction of Speech Signals Using Radial Basis Function Networks", Proc. Europ. Signal Process. Conf., vol. 1, pp. 459-462, 1996.
[14] A. Kumar, A. Gersho; "LD-CELP Speech Coding with Nonlinear Prediction", IEEE Signal Processing Letters, 4, pp. 89-91, 1997.
[15] N. Ma, G. Wei; "Speech Coding with Nonlinear Local Prediction Model", Proc. IEEE ICASSP, vol. 2, pp. 1101-1104, 1998.
[16] M. Faúndez-Zanuy, S. McLaughlin, A. Esposito, A. Hussain, J. Schoentgen, G. Kubin, W. B. Kleijn, P. Maragos; "Nonlinear Speech Processing: Overview and Applications," Control and Intelligent Systems, 30, pp. 1-10, 2002.
[17] M. Faúndez-Zanuy; "Nonlinear Speech Coding with MLP, RBF and Elman Based Prediction", Lecture Notes in Computer Science, 2687, pp. 671-678, 2003.
[18] M. G. Easton, C. C. Goodyear; "A CELP Codebook and Search Technique Using a Hopfield Net", Proc. IEEE ICASSP, pp. 685-688, 1991.
[19] A. Indrayanto, A. Langi, W. Kinsner; "A Neural Network Mapper for Stochastic Codebook Parameter Encoding in Code Excited Linear Predictive Speech Processing", Proc. IEEE West. Canada Conf. Comp., Power and Commun. Sys., pp. 221-224, 1991.
[20] L. A. Hernandez-Gomez, E. Lopez-Gonzalo; "Phonetically-Driven CELP Coding Using Self-Organizing Maps", Proc. IEEE ICASSP, vol. 2, pp. 628-631, 1993.
[21] L. Wu, M. Niranjan, F. Fallside; "Fully Vector-Quantized Neural Network-Based Code Excited Nonlinear Predictive Speech Coding", IEEE Trans. Speech and Audio Processing, 2, pp. 482-489, 1994.
[22] S. Wu, G. Zhang, X. Zhang, Q. Zhao; "A LD-CELP Speech Coding Algorithm Based on Modified SOFM Vector Quantizer", Proc. Int. Symp. Intell. Inform. Technol. Appl., pp. 408-411, 2008.
[23] V. Huong, B. J. Min, D. C. Park, D. M. Woo; "A New Vocoder Based on AMR 7.4 Kbit/S Mode in Speaker Dependent Coding System", Proc. ACIS Int. Conf. Soft. Engng., Artif. Intell., Network., and Parallel/Distributed Comp., pp. 163-167, 2008.
[24] T. Tremain; "The Government Standard Linear Predictive Coding Algorithm: LPC-10", Speech Technology, 1, pp. 40-49, 1982.
[25] A. Spanias; "Speech Coding: A Tutorial Review", Proc. IEEE, 82, pp. 1541-1582, 1994.
[26] O. Wiriyanuruknakorn, J. Srinonchat; "A Finite State Vector Quantizer for New Bit Rate Speech Compression", Proc. Int. Conf. Signal Processing, Commun. and Network., pp. 255-259, 2008.
[27] A. Gersho, R. M. Gray; Vector Quantization and Signal Compression, Kluwer Academic Publishers, 1992.
[28] T. Kohonen, Self-Organizing Maps, Springer Series in Information Sciences, 1995.
[29] M. Hagenbuchner, A. Sperduti, A. Tsoi; "A Self-Organizing Map for Adaptive Processing of Structured Data", IEEE Trans. on Neural Networks, 14, pp. 491-505, 2003.
[30] G. A. Carpenter, S. Grossberg, J. H. Reynolds; "ARTMAP: Supervised Real-Time Learning and Classification of Nonstationary Data by a Self-Organizing Neural Network", Neural Networks, 4, pp. 565-588, 1991.
[31] L. Zadeh; "Fuzzy Sets", Inform. Contr., 8, pp. 338-353, 1965.
[32] M. Bijankhan, J. Sheikhzadegan, M. R. Roohani, Y. Samareh, C. Lucas and M. Tebyani; "FARSDAT- The Speech Database of Farsi Spoken Language", Proc. Australian Conf. on Speech Science and Technology, vol. 2, pp. 826-830, 1994.