FIELD OF INVENTION
&null;0001&null; This invention relates to speech recognition and more particularly to a speech recognition method using speech model parameters that depend on acoustic environment.
BACKGROUND OF INVENTION
&null;0002&null; Speech recognition in different environments using Hidden Markov Models (HMMs) requires modeling speech distribution in the given environment. It has been observed quite often that the mismatched training and testing environments can lead to severe degradation in recognition performance. See article by Yifan Gong entitled &null;Speech Recognition in Noisy Environments A Survey&null; in Speech Communication, 16(3): pages 261-291,1992. In order to achieve robust speech recognition in noise, different approaches have been proposed to deal with the mismatch issue. Among these methods, people use noisy speech during the training phase which can be generalized to multi-condition training where available speech data collected in a variety of environments is used in model training. See the following references for more description.
&null;0003&null; Dautrich, B. A., Rabiner, L. R., and Martin, T. B. &null;On the Effect of varying Filter Bank Parameters on Isolated Word Recognition&null;, IEEE Transactions on Acoustic, Speech and Signal Processing, ASSP-31: 793-806, 1983.
&null;0004&null; Morii, S. T., Morii, T., and Hoshimmi, M. &null;Noise Robustness in Speaker Independent Speech Recognition&null;, International Conference on Spoken Language Processing, Pp. 1145-1148, 1990.
&null;0005&null; Furui, S. &null;Toward Robust Speech Recognition Under Adverse Conditions&null;, ESCA Workshop Proceedings of Speech Processing in Adverse Conditions, Pp. 31-41, 1992.
&null;0006&null; Vaseghi, S. V., Milner, B. P., and Humphries, J. J. &null;Noisy Speech Recognition Using Cepstral-Time Features and Spectral-Time Filters&null;, ICASSP, Pp 925-928. 1994.
&null;0007&null; Mokbel, C. and Chollet, G. &null;Speech Recognition in Adverse Environments: Speech Enhancement and Spectral Transformations: ICASSP, Pp. 925-928, 1991.
&null;0008&null; Lippman, R. P., Martin, E. A. and Paul, D. B. &null;Multi-style Training for Robust Isolated-Word Speech Recognition&null;, ICASSP Pp. 705-708, 1987.
&null;0009&null; Blanchet, M., Boudy, J. and Lockwood, P. &null;Environment Adaptation for Speech Recognition in Noise,&null; EUSIPCO, vol. VI, Pp 391-394, 1992.
&null;0010&null; Published Gaussian mixture hidden Markov modeling of speech uses multiple Gaussian distributions to cover the spread of the speech distribution caused by the noise. Two problems with this approach can be mentioned.
&null;0011&null; Since no noise model is incorporated and since the recognition accuracy is only optimized to the intensity characteristics of the training noise, recognition performance could be sensitive to noise level.
&null;0012&null; At the recognition time, a speech signal can only be produced in a particular environment. However, for a given noisy environment, the distribution of all conditions, as well as the ones corresponding to the given environment, are open to the search space. The variety of the noisy speech distributions decreases the model discrimination ability. Therefore, the improvement on noisy speech recognition is obtained at the cost of sacrificing the recognition rate for clean speech.
&null;0013&null; Because of the two problems, the modeling of speech events could be distracted by the inefficient use of parameters, resulting in the loss of discrimination ability.
SUMMARY OF THE INVENTION
&null;0014&null; In accordance with one embodiment of the present invention the modeling of speech signals uses variable parameter Gaussian mixture HMM. Existing HMM is extended by allowing HMM parameters to change as function of a continuous variable that depends on the environment. At the recognition time, a set of HMMs will be instantiated corresponding to a given environment.
DESCRIPTION OF DRAWING
&null;0015&null; FIG. 1 is a variable parameter GHMM training block diagram.
&null;0016&null; FIG. 2 is a variable parameter GMHMM recognition block diagram.
&null;0017&null; FIG. 3 is a variable parameter GMHMM regression function initialization block diagram.
&null;0018&null; FIG. 4 is a variable parameter GMHMM re-estimation block diagram.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
&null;0019&null; FIG. 1 is a block diagram showing the variable parameter GMHMM training module 11. The input signal is first converted to a sequence of feature vectors by the feature extraction block 13. The environment estimation block 15 estimates an environment variable that is based on the input speech signal. Using the estimated environment information, variable parameter training algorithm in block 17 generates variable parameter (VP) Gaussian Mixture Hidden Markov Model (GMHMM) from the speech feature vector sequence. This is stored is a database. 19.
&null;0020&null; FIG. 2 is a block diagram showing the variable parameter GMHMM recognition module 21. The input signal is applied to feature extraction block 22 and environment estimation block 23. During the recognition time, environment estimation block 23 estimates the environment variable of the speech to be recognized and instantiate a set of GMHMM 25 based on the variable which is used to conduct recognition process at recognition 27.
&null;0021&null; The training module algorithm of variable parameter GMHMM contains two parts, one is the initialization of GMHMM parameter functions and the other is the re-estimation procedure based on Expectation-Maximization (EM) algorithm. Referring to FIG. 3, in the function initialization step, a set of environment-specific variable values is chosen, which includes adequate cases of different environment conditions. This set of environment variable values is representative for a wide range of environments.
&null;0022&null; Particularly, signal-to-noise ratio can be adopted as a variable to model the environment. In that case, the set of values could be different signal-to-noise ratio (SNR) levels. For all the values in this set, conventional GMHMM model is trained. The resulting models under those environment variable values are regressed by the parameter functions with respect to those environment variable values. The regression functions are considered as the initialization GMHMM parameter functions for the variable parameter GMHMM. The process steps in FIG. 3 start with Step 1 of choosing a specific environment. Step 2 is performing conventional GMHMM training and storing the result in a database is step 3. These steps repeat in step 4 until enough environments have been stored. The next step 5 is performing function regression on GMFMM parameters with respect to the environment variables.
&null;0023&null; The variable parameter re-estimation procedure is maximum likelihood criterion based Expectation-Maximization (EM) algorithm which is illustrated in FIG. 4 for a special case where polynomial function is chosen to model the Gaussian mean function and SNR is chosen as the environment variable. For the input speech feature vector sequence, SNR is estimated for each frame and a specific set of GMHMM parameters is generated by substituting current SNR value into the mean vector polynomial. The likelihoods of feature vectors are computed using newly generated models which is followed by forward and backward variable calculation.
&null;0024&null; In a conventional HMM based recognizer, at the state i, the emission probability density function is a multivariate Gaussian mixture distribution which can be expressed as
1
p
(
o
t
&null;
&null;
s
t
=
i
)
=
&null;
k
&null;
&null;
i
,
k
&null;
b
i
,
k
&null;
(
o
t
)
=
&null;
k
&null;
&null;
i
,
k
&null;
N
&null;
(
o
t
;
&null;
i
,
k
,
&null;
i
,
k
)
(
1
)
&null;0025&null; where:
&null;0026&null; ot is the input vector at time t, in D-dimensional feature space.
&null;0027&null; &null;i,k is the mean vector of the kth mixture component at the state i.
&null;0028&null; &null;i,k is the covariance matrix of the kth mixture component at the state i.
&null;0029&null; &null;i,k&null;Pr(&null;t&null;k&null;st&null;i) is the a prior probability of the kth mixture component at the state i.
&null;0030&null; In the VP-GMHM, the observation mean vector is modeled as a polynomial function of environment &null;:
2
&null;
ik
&null;
(
&null;
)
=
&null;
j
P
ik
&null;
&null;
&null;
c
ikj
&null;
&null;
j
(
2
)
&null;0031&null; where Pik is the order of polynome for the kth mixture component at the state i.
&null;0032&null; Let cik be the vector composed of &null;cik1, cik2, cikj, . . . &null;&null;. The polynomial coefficients of the mean vector can be solved through linear system equation:
Aikcik&null;bik&null;&null;(3)
&null;0033&null; where A ik is a (Pik&null;1)&null;(Pik&null;1) dimensional matrix:
3
A
ik
=
[
u
ik
&null;
(
0
,
0
)
&null;
u
ik
&null;
(
0
,
P
ik
)
&null;
u
ik
&null;
(
j
,
p
)
&null;
u
ik
(
P
ik
,
0
)
&null;
u
ik
&null;
(
P
ik
,
)
&null;
P
ik
)
]
&null;0034&null; where uik (j,p) itself is a D by D matrix:
uik(j,p)&null;1ik(vr,vr,j,p)
&null;0035&null; bik is a Pik&null;1 dimensional vector in D-dimensional space:
bik&null;&null;vik (0), . . . , vik(j), . . . vik(Pik)&null;T
&null;0036&null; where vik(j) itself is a D dimensional vector:
vik(j)&null;1ik(vr,ot,r,j,1)
&null;0037&null; and cik a Pik&null;1 dimensional vector in D-Dimensional space:
cik&null;&null;cik(0), . . . , cik(j), . . . cik(Pik)&null;T
&null;0038&null; The components of the linear system equation have the form:
4
I
ik
&null;
(
&null;
,
&null;
,
&null;
,
&null;
)
=
&null;
r
=
1
R
&null;
&null;
&null;
&null;
t
=
1
T
r
&null;
p
(
s
t
r
=
i
,
&null;
t
r
=
k
&null;
&null;
O
r
,
&null;
_
)
&null;
&null;
ik
-
1
&null;
&null;
&null;
&null;
&null;
&null;
&null;
&null;
&null;
,
&null;0039&null; where
&null;0040&null; Aik is composed of the powers of environment variable weighted by the count for state i and the kth Gaussian component and inverse of the covariance matrix;
&null;0041&null; bik is composed of the product of powers of observation and environment variable weighted by the count for state i Gaussian mixture k and inverse of the covariance matrix. The covariance matrix is estimated as the ratio of expected covariance value under model parameters for current environment variable in state i and kth Gaussian and expected number of staying in state i and kth Gaussian:
5
&null;
ik
&null;
&null;
&null;
=
&null;
r
=
1
R
&null;
&null;
&null;
&null;
t
=
1
T
r
&null;
p
(
s
t
r
=
i
,
&null;
t
r
=
k
&null;
&null;
O
r
,
&null;
_
)
&null;
(
o
t
r
-
&null;
j
=
0
P
ik
&null;
c
ikj
&null;
(
&null;
r
)
j
)
&null;
o
t
r
-
&null;
j
=
0
P
ik
&null;
c
ikj
&null;
(
&null;
r
)
j
)
T
&null;
r
=
1
R
&null;
&null;
&null;
&null;
t
=
1
T
r
&null;
p
(
s
t
r
=
i
,
&null;
t
r
=
k
&null;
&null;
O
r
,
&null;
_
)
(
4
)
&null;0042&null; In the above equations,
&null;0043&null; R is the number of speech segments.
&null;0044&null; Tr is the number of vectors of the rth segment.
&null;0045&null; otr is the tth vector of segment r.
&null;0046&null; vr is the environment measurement for the rth segment.
&null;0047&null; In the steps for speech recognition the model parameters are permitted to change as a function of environment variables. In the training process, the environment dependent model parameters are estimated by EM algorithm. In the signal to noise case the effect of noise on speech modeling is determined and this changes is modeled as a function of signal-to-noise ratio (SNR). The function is considered as a polynomial function. All of the algorithms provide model values as a condition of that polynomial. In the recognition process, a set of HMMs is instantiated according to the given environment. For SNR case, for example, the SNR is measured and one evaluates the polynomial as a function of SNR. The particular value from the polynomial is determined and that value is used for the recognition model.
&null;0048&null; Basically, the model Gaussian mean function is not fixed as in previous HMMs cases but is a function of the signal-to-noise ratio (SNR). The method of representing a parameter as a function of environment. This method can be applied to mean vector, covariance, transition, anything.
&null;0049&null; The model parameters may be any HMM parameters such as mean, covariance, state transition probability, etc. The environment variables can be any quantities that gives some measurement of the environment, in particular it can be as signal to noise ratio, the noise power, etc. Further, rather than a scalar variable, it could be an environment variable vector. The environment variable could be based on the whole utterance, each phoneme or even each frame. The parameter functions could be any continuous function. In particular, it could be polynomial function, exponential function, etc.
&null;0050&null; The training can be in two steps of parameter function initialization and parameter re-estimation based on EM algorithm. The parameter function initialization could be any regression method on the model parameters with respect to environment variables.
&null;0051&null; In accordance with one embodiment of the present invention when using polynomials function to describe change of mean vector, initial state probability is re-estimated as expected number of times in state i at time 1, based on the model instantiated by the parameter function and corresponding environment variables; state transition probability is re-estimated as the ratio of expected number of transitions from state i to state j and expected number of those transitions from state i, based on the model instantiated by the parameter function and corresponding environment variables; mixture weight is estimated as the ratio of expected number of staying in the kth Gaussian and expected number of those transitions from state i, based on the model instantiated by the parameter function and corresponding environment variables; mean vector polynomial estimation is solved as a linear system equation with matrix component being the product of powers of two quantities weighted by the count for state i, Gaussian mixture component k and inverse of the covariance; and covariance is estimated as the ratio of expected covariance in state i and kth Gaussian mixture component and expected number of staying in state i and kth Gaussian, based on the model instantiated by the parameter function and corresponding environment variables.
&null;0052&null; The method may be carried out in specific ways other than those set forth here without departing from the spirit and essential characteristics of the invention. Therefore, the presented embodiments should be considered in all respects as illustrative and not restrictive and all modifications falling within the meaning and equivalency range of the appended claims are intended to be embraced therein.