Parametrizaciones robustas de Reconocimiento Automático de Habla (RAH) en redes de comunicaciones

Diego Ferney Gómez Cajas; Franklin Alexander Sepúlveda Sepúlveda; Mario Augusto Pinto Serrano

Autores/as

Diego Ferney Gómez Cajas Universidad Antonio Nariño - Ingeniería Biomédica
Franklin Alexander Sepúlveda Sepúlveda Universidad Industrial de Santander
Mario Augusto Pinto Serrano Universidad Nacional de Colombia

Palabras clave:

ASR, Speech Coding, CELP coders, packet networks, VolP, transmission errors, packet loss, noise, mobile networks, UMTS, LTE

Resumen

In this paper we address the problem of Automatic Speech Recognition (ASR) when the speech signal has been transmitted over communications networks. In these conditions, the main causes of distortion in an ASR system are ambient noise, transmission errors and the encoding-decoding process [32]. In the literature we are able to find multiple solutions for this problem, from different points of views; however,in this paper we will focus the analysis on solutions with robust parameterizations for the above distortions.

Descargas

Los datos de descargas todavía no están disponibles.

Citas

Atal, B. S. (1974). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. the Journal of the Acoustical Society of America, 55, 1304.

Atal, B. S., Cox, R. V., & Kroon, P. (1989, May). Spectral quantization and interpolation for CELP coders. In Acoustics, Speech, and Signal Processing, 1989. ICASSP-89., 1989 International Conference on (pp. 69-72). IEEE.

Bessette, B., Salami, R., Lefebvre, R., Jelinek, M., Rotola-Pukkila, J., Vainio, J., ...&Jarvinen, K. (2002). The adaptive multirate wideband speech codec (AMR-WB). Speech and Audio Processing, IEEE Transactionson, 10(8), 620-636.

Carlson, A. B., & Contreras, J. R. S. (1980). Sistemas de comunicación. McGraw-Hill.

Chia-Ping Chen; Bilmes, J.; Ellis, D.P.W., “Speech Feature Smoothing for Robust ASR,” Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ‘05). IEEE International Conference on , vol.1, no., pp.525,528, March 18-23, 2005.

De Vicente Peña, J. (2007). Contribuciones al reconocimiento robusto de habla.

Euler, S., &Zinke, J. (1994, April). The influence of speech coding algorithms on automatic speech recognition. In Acoustics, Speech, and Signal Processing, 1994. ICASSP-94., 1994 IEEE International Conference on (Vol. 1, pp. I-621). IEEE.

Fant, G. (1982). The voice source-acoustic modeling. STLQPSR, 4, 28-48.

Flynn, R., & Jones, E. (2010). Robust distributed speech recognition in noise and packet loss conditions. Digital SignalProcessing, 20(6), 1559-1571.

Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. Acoustics, Speech and SignalProcessing, IEEE Transactionson, 34(1), 52-59.

Gallardo-Antolín, A., Peláez-Moreno, C., & Díaz-de-María, F. (2005). Recognizing GSM digital speech. Speech and Audio Processing, IEEE Transactions on, 13(6), 1186-1205.

Gómez, A. M., Peinado, A. M., Sánchez, V., & Rubio, A. J. (2006). Recognition of coded speech transmitted over wireless channels. Wireless Communications, IEEE Transactions on, 5(9), 2555-2562.

Gómez-Cajas, D. F. (2011). Contribuciones al reconocimiento robusto de habla en redes de comunicaciones mediante transparametrización: tesis doctoral (Doctoral dissertation, Universidad Carlos III de Madrid).

Gómez-Cajas, D. F., Peláez-Moreno, C., & Díaz-de-María, F. (2003) Reconocimiento robusto de habla en redes IP. In Actas de las XIII JORNADAS de I+D en Telecomunicaciones (TELECOMI+D+03) Madrid, España.

Gómez-Cajas, D. F., Peláez-Moreno, C., & Díaz-de-María, F. (2003) Reconocimiento robusto de habla en entornos IP. In Proceedings of the International Conference on Internet Technologies, Popayán, Colombia.

Gómez-Cajas, D. F., Peláez-Moreno, C., & Díaz-de-María, F. (2012, November). UEPdriven extended feature extraction for ASR over 3G speech channels. In Circuits and Systems (CWCAS), 2012 IEEE 4th Colombian Workshop on (pp. 1-5). IEEE.

Goode, B. (2002). Voice over internet protocol (VoIP). Proceedings of the IEEE,90(9), 1495-1517.

Toskala, A., & Holma, H. (Eds.). (2001). WCDMA for UMTS: Radio Access for Third Generation Mobile Communications. Wiley. Huerta, J. M., & Stern, R. M. (1998, November). Speech recognition from GSM codec parameters. In ICSLP.

ITU - Rec. G.711 (1993). Pulse code modulation (PCM) of voice frequencies.International Telecommunications Union, February.

ITU - Rec. G.729 (1996). Coding of speech at 8 kbit/s using conjugate structure algebraiccode-excited linear-prediction (CS-ACELP).

ITU - Rec.G723 (1996). Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s.

Kim, H. K., & Cox, R. V. (2001). A bitstreambased front-end for wireless speech recognition on IS-136 communications system. Speech and Audio Processing, IEEE Transactionson, 9(5), 558-568.

Kim, H. K., Kim, K. C., & Lee, H. S. (1993). Enhanced distance measure for LSP-based speech recognition. Electronics letters, 29(16), 1463-1465.

Kondoz, A. M. (2005). Digital speech: coding for low bit rate communication systems. Wiley.

Lieberman, P. (1988). Speech physiology, speech perception, and acoustic phonetics. Cambridge University Press.

Lilly, B. T., &Paliwal, K. K. (1996, October). Effect of speech coders on speech recognition performance. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International Conference on (Vol. 4, pp. 2344-2347). IEEE.

Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4), 561-580.

Milner, B., &Semnani, S. (2000). Robust speech recognition over IP networks. In Acoustics, Speech, and Signal Processing, 2000. ICASSP›00. Proceedings. 2000 IEEE International Conference on (Vol. 3, pp. 1791-1794). IEEE.

Müller, J., andBaly, W. The Physiology of the Senses, Voice, and MuscularMotion, with the Mental Faculties... Taylor, Walton &Maberly, 1848.

Nishimura, Y., Shinozaki, T., Iwano, K., &Furui, S. (2004). Noise-robust speech recognition using multi-band spectral features. TheJournal of theAcousticalSociety of America, 116, 2480.

Peláez-Moreno, C. (2002). Reconocimiento de habla mediante transparametrización: una alternativa robusta para entornos móviles e IP (Doctoral dissertation, Universidad Carlos III de Madrid).

Peláez-Moreno, C., Gallardo-Antolín, A., & Díaz-de-María, F. (2001). Recognizing voice over IP: A robust front-end for speech recognition on the World Wide Web. Multimedia, IEEE Transactionson, 3(2), 209-218.

Peláez-Moreno, C., Gallardo-Antolín, A., Gómez-Cajas, D. F., & Díaz-de-María, F. (2006). A comparison of front-ends for bitstream-based ASR over IP. Signal Processing, 86(7), 1502-1508.

Pelaez-Moreno, C.; Gallardo-Antolin, A.; Diaz-de-Maria, F., “Recognizing voice over IP: a robust front-end for speech recognition on the world wide web,” Multimedia, IEEE Transactions on , vol.3, no.2, pp.209,218, Jun 2001.

Perkins, C., Hodson, O., & Hardman, V. (1998). A survey of packet loss recovery techniques for streaming audio. Network, IEEE, 12(5), 40-48.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257-286.

Rabiner, L., &Juang, B. H. (1993). Fundamentals of speechrecognition.

Rose, R. C., Parthasarathy, S., Gajic, B., Rosenberg, A. E., & Narayanan, S. (2001). On the implementation of ASR algorithms for hand-held wireless mobile devices. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on (Vol. 1, pp. 17-20). IEEE.

Schroeder, M., &Atal, B. S. (1985, April). Code-excited linear prediction (CELP): High-quality speech at very low bit rates. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’85. (Vol. 10, pp. 937-940). IEEE.

SMITH, J. O. (2008). Spectral Audio Signal Processing, Draft. Online: http://ccrma.stanford.edu/ jos/sasp/.

Soong, F., &Juang, B. (1984, March). Line spectrum pair (LSP) and speech data compression. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’84. (Vol. 9, pp. 37-40). IEEE.

Stevens, S. S., Volkmann, J., & Newman, E. B. (1937). A scale for the measurement of the psychological magnitude pitch TheJournal of theAcousticalSociety of America, 8, 185.

Sugamura, N., &Itakura, F. (1986). Speech analysis and synthesis methods developed at ECL in NTT—From LPC to LSP—. Speechcommunication, 5(2), 199-215.

Sun, L., Wade, G., Lines, B., &Ifeachor, E. (2001, April). Impact of packet loss location on perceived speech quality. In 2nd IP-Telephony Workshop (pp. 114-122).

Toga, J., &Ott, J. (1999). ITU-T standardization activities for interactive multimedia communications on packet-based networks: H. 323 and related recommendations. Computer Networks, 31(3), 205-223.

Tyagi, V.; McCowan, I.; Misra, H.; Bourlard, H., «Mel-cepstrum modulation spectrum (MCMS) features for robust ASR,» Automatic Speech Recognition and Understanding, 2003. ASRU ‹03. 2003 IEEE Workshop on, vol., no., pp.399,404, 30 Nov.-3 Dec. 2003

Vicente-Peña, J., Gallardo-Antolín, A., PeláezMoreno, C., & Díaz-de-María, F. (2006). Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition. Speechcommunication, 48(10), 1379-1398.

Zheng, F., Zhang, G., & Song, Z. (2001). Comparison of different implementations of MFCC. Journal of ComputerScience and Technology,16(6), 582-589.

Zhou, J., Wu, T., &Leng, J. (2010, June). Research on voice codec algorithms of SIP phone based on embedded system. In Wireless Communications, Networking and Information Security (WCNIS), 2010 IEEE International Conference on (pp. 183-187). IEEE.

Ziemer, R. E., Tranter, W. H., Buehrer, R. M., & Rappaport, T. S. (2000).Mobile Radio Communications. John Wiley & Sons, Inc.

3GPP TSG-RAN, «3GPP TR 25.814, Physical Layer Aspects for Evolved UTRA (Release 7)», v1.3.1 (2006-05).

3GPP TS 25.212, «Multiplexing and channel coding (FDD)». V6.2.0. 2004-06.

3GPP TS 25.211, «Physical channels and mapping of transport channels onto physical channels (FDD)».V6.1.0. 2004-6.

Parametrizaciones robustas de Reconocimiento Automático de Habla (RAH) en redes de comunicaciones

Autores/as

Palabras clave:

Resumen

Descargas

Citas

Descargas

Publicado

Cómo citar

Número

Sección

Licencia

Métrica

registrada

Registrada

Enviar un artículo

Idioma

Navegar

Información

Palabras clave

ciencia-abierta