1. Introduction
Speech technology has achieved certain success, especially in speech compression.
However, all methods of speech processing actually do not use the fundamental
properties of the speech signal which is specially designed for information
transmission from a human to a human. As a result of such an approach, the areas of
application of the existing speech systems are much narrower then the potential ones.
In a fact, the resources of further improvement in speech recognition, synthesis and
even speech compression are exhausted. In particular, speech vocoders with low bit
rate do not transmit the voice individuality, and speech intelligibility drops when
vocoders are sequentially connected. Intelligibility of the synthetic speech rapidly
drops in the presence of even small noise or when human attention is distributed
between the listening and some other task. Speech recognition systems are very
unstable with respect to noise and speech signal distortion. The minimally acceptable
speech recognition rate usually can be achieved only after long human-machine
mutual adaptation.
Speech research was conducting in the Institute for Information Transmission Problems since 1962. At present, speech group consists of 9 researchers, including two professors in physics and mathematics, two Ph.D. researchers, three programmers and two linguists. The main direction of study is the application of the mathematical models of speech production and the code structure of speech to speech recognition, synthesis and speech compression. The results of physiological study of speech together with the mathematical models of mechanics, aerodynamics and acoustics of speech was presented in the monograph "Theory of Speech Production" by V.N.Sorokin, 1985, Radio and Telecommunication, Moscow (in Russian).
2. Speech synthesis
The mathematical models of speech production were implemented in an articulatory
speech synthesizer. The syllable intelligibility about 100% and the syllable naturalness
about 85% was found in the auditory tests of the synthesizer. The number of
parameters in the synthesizer provides a wide variety of individual voices. This kind
of synthesis may be used both in the text-to-speech synthesizers and in the
articulatory vocoder. The results of speech synthesis study are presented in the
monograph "Speech Synthesis" by V.N.Sorokin, 1992, Science, Moscow (in Russian).
3. Speech recognition
For the last decade, there was developing a fundamentally new concept of speech
recognition based on the properties of speech production and speech perception.
Information transmitted by the speech signal is preserved from channel noise,
distortion and individual variability of articulation by means of frequency, amplitude
and temporal modulation of acoustical parameters, and also by use of complex code
structure of speech. The developing of effective speech recognition system consists in
the search for the adequate demodulating and decoding. As a result of conducted
study, there were found primary (non-specific) detectors of frequency-temporal
heterogeneity in the speech signal. These detectors are used to construct specific
detectors of articulatory events i.e., the detectors of the transition from an articulatory
state to another one. The analysis of speech dynamics can be supplemented with the
solution of inverse problem with respect to vocal tract shape. The speech signal is
encoded in terms of the articulatory states, and speech recognition is the result of
decoding by means of methods developed in the theory of error correcting codes. It
was shown theoretically that the potential speech recognition rate in the noise
environment may be even better than the intelligibility demonstrated by humans.
The preliminary examination of the developed approach was carried out for the voice
dialing task. Texas Instrument database TI-46 with 16 speakers and the vocabulary of
20 words separately spoken was used. The average recognition rate in the speaker-
independent mode about 95% was obtained. The further developing of the system is
conducting for a voice dialing task for Russian. 52 words spoken many times in the
isolated, separated and continuous mode were recorded from 48 speakers and 5 types
of microphones and telephone handsets. Also, the signals distorted by a telephone
channel are used. A theoretical estimation of the potential recognition rate for the
studied system in the speaker-independent, channel-immune mode is about 99%.
The main principles of the developing system are described in the Russian Patent
N2047912 with the priority of 20 April, 1994.
3. Speech compression
Theoretically, the ultimate compression of the speech signal preserving speaker
individuality can be achieved by means of solving the so-called inverse problem for
the vocal tract. The problem is ill-posed since, in general, there is no guarantee of
stable and unique solution. There were found constraints and criteria of optimality
providing stable solutions for steady-state segments of the speech signal and for
inversion from articulatory movements to articulatory control commands. It was also
found that the solving general inverse problem (from acoustics to controls) requires
some preliminary recognition of a type of speech segment.
Preliminary estimation of the potential bit-rate without statistical reduction is about
2500 bit/s.
The results of inverse problem study are published in 12 papers. The recent ones are:
V.N.Sorokin, A.V.Trushkin, and A.S.Leonov (2000), "Estimation of Stability and
Accuracy of Inverse Problem Solution for the Vocal Tract", Speech Communication,
v. 30, N1, pp. 55-74.
A.S.Leonov and V.N.Sorokin (2000), "Inverse problem for the vocal tract:
Identification of control forces from the articulatory movements", Pattern
Recognition and Image Processing, v. 10, N1, pp. 110-126.