SPEECH RECOGNITION, SYNTHESIS AND COMPRESSION

V.N.Sorokin
Institute for Information Transmission Problems,
Russian Academy of Sciences
Bolshoy Karetny 19, Moscow 101447, GSP-4, RUSSIA
E-mail: vns@iitp.ru, tel. 7-095-299-5096, fax 7-095-209-0579

1. Introduction
Speech technology has achieved certain success, especially in speech compression. However, all methods of speech processing actually do not use the fundamental properties of the speech signal which is specially designed for information transmission from a human to a human. As a result of such an approach, the areas of application of the existing speech systems are much narrower then the potential ones. In a fact, the resources of further improvement in speech recognition, synthesis and even speech compression are exhausted. In particular, speech vocoders with low bit rate do not transmit the voice individuality, and speech intelligibility drops when vocoders are sequentially connected. Intelligibility of the synthetic speech rapidly drops in the presence of even small noise or when human attention is distributed between the listening and some other task. Speech recognition systems are very unstable with respect to noise and speech signal distortion. The minimally acceptable speech recognition rate usually can be achieved only after long human-machine mutual adaptation.

Speech research was conducting in the Institute for Information Transmission Problems since 1962. At present, speech group consists of 9 researchers, including two professors in physics and mathematics, two Ph.D. researchers, three programmers and two linguists. The main direction of study is the application of the mathematical models of speech production and the code structure of speech to speech recognition, synthesis and speech compression. The results of physiological study of speech together with the mathematical models of mechanics, aerodynamics and acoustics of speech was presented in the monograph "Theory of Speech Production" by V.N.Sorokin, 1985, Radio and Telecommunication, Moscow (in Russian).

2. Speech synthesis
The mathematical models of speech production were implemented in an articulatory speech synthesizer. The syllable intelligibility about 100% and the syllable naturalness about 85% was found in the auditory tests of the synthesizer. The number of parameters in the synthesizer provides a wide variety of individual voices. This kind of synthesis may be used both in the text-to-speech synthesizers and in the articulatory vocoder. The results of speech synthesis study are presented in the monograph "Speech Synthesis" by V.N.Sorokin, 1992, Science, Moscow (in Russian).

3. Speech recognition
For the last decade, there was developing a fundamentally new concept of speech recognition based on the properties of speech production and speech perception. Information transmitted by the speech signal is preserved from channel noise, distortion and individual variability of articulation by means of frequency, amplitude and temporal modulation of acoustical parameters, and also by use of complex code structure of speech. The developing of effective speech recognition system consists in the search for the adequate demodulating and decoding. As a result of conducted study, there were found primary (non-specific) detectors of frequency-temporal heterogeneity in the speech signal. These detectors are used to construct specific detectors of articulatory events i.e., the detectors of the transition from an articulatory state to another one. The analysis of speech dynamics can be supplemented with the solution of inverse problem with respect to vocal tract shape. The speech signal is encoded in terms of the articulatory states, and speech recognition is the result of decoding by means of methods developed in the theory of error correcting codes. It was shown theoretically that the potential speech recognition rate in the noise environment may be even better than the intelligibility demonstrated by humans. The preliminary examination of the developed approach was carried out for the voice dialing task. Texas Instrument database TI-46 with 16 speakers and the vocabulary of 20 words separately spoken was used. The average recognition rate in the speaker- independent mode about 95% was obtained. The further developing of the system is conducting for a voice dialing task for Russian. 52 words spoken many times in the isolated, separated and continuous mode were recorded from 48 speakers and 5 types of microphones and telephone handsets. Also, the signals distorted by a telephone channel are used. A theoretical estimation of the potential recognition rate for the studied system in the speaker-independent, channel-immune mode is about 99%. The main principles of the developing system are described in the Russian Patent N2047912 with the priority of 20 April, 1994.

3. Speech compression
Theoretically, the ultimate compression of the speech signal preserving speaker individuality can be achieved by means of solving the so-called inverse problem for the vocal tract. The problem is ill-posed since, in general, there is no guarantee of stable and unique solution. There were found constraints and criteria of optimality providing stable solutions for steady-state segments of the speech signal and for inversion from articulatory movements to articulatory control commands. It was also found that the solving general inverse problem (from acoustics to controls) requires some preliminary recognition of a type of speech segment. Preliminary estimation of the potential bit-rate without statistical reduction is about 2500 bit/s.

The results of inverse problem study are published in 12 papers. The recent ones are: V.N.Sorokin, A.V.Trushkin, and A.S.Leonov (2000), "Estimation of Stability and Accuracy of Inverse Problem Solution for the Vocal Tract", Speech Communication, v. 30, N1, pp. 55-74.
A.S.Leonov and V.N.Sorokin (2000), "Inverse problem for the vocal tract: Identification of control forces from the articulatory movements", Pattern Recognition and Image Processing, v. 10, N1, pp. 110-126.