专利汇可以提供Source normalization training for modeling of speech专利检索,专利查询,专利分析的服务。并且A maximum likelihood (ML) linear regression (LR) solution to environment normalization is provided where the environment is modeled as a hidden (non-observable) variable. By application of an expectation maximization algorithm and extension of Baum-Welch forward and backward variables (Steps 23a-23d) a source normalization is achieved such that it is not necessary to label a database in terms of environment such as speaker identity, channel, microphone and noise type.,下面是Source normalization training for modeling of speech专利的具体信息内容。
This invention relates to training for Hidden Markov Model (HMM) modeling of speech and more particularly to removing environmental factors from speech signal during the training procedure.
In the present application we refer to speaker, handset or microphone, transmission channel, noise background conditions, or combination of these as the environment. A speech signal can only be measured in a particular environment. Speech recognizers suffer from environment variability because trained model distributions may be biased from testing signal distributions because environment mismatch and trained model distributions are flat because they are averaged over different environments.
The first problem, the environmental mismatch, can be reduced through model adaptation, based on some utterances collected in the testing environment. To solve the second problem, the environmental factors should be removed from the speech signal during the training procedure, mainly by source normalization.
In the direction of source normalization, speaker adaptive training uses linear regression (LR) solutions to decrease inter-speaker variability. See for example, T. Anastasakos, et al. entitled, "A compact model for speaker-adaptive training," International Conference on Spoken Language Processing, Vol. 2, October 1996. Another technique models mean-vectors as the sum of a speaker-independent bias and a speaker-dependent vector. This is found in A. Acero, et al. entitled, "Speaker and Gender Normalization for Continuous-Density Hidden Markov Models," in Proc. Of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 342-345, Atlanta, 1996. Both of these techniques require explicit label of the classes. For example, speaker or gender of the utterance during the training. Therefore, they can not be used to train clusters of classes, which represent acoustically close speaker, hand set or microphone, or background noises. Such inability of discovering clusters may be a disadvantage in an application.
An illustrative embodiment of the present invention seeks to provide a method for source normalization training for HMM modeling of speech that avoids or minimizes above-mentioned problems.
Aspects of the invention are specified in the claims. In carrying out principles of the present invention, a method provides a maximum likelihood (ML) linear regression (LR) solution to the environment normalization problem, where the environment is modeled as a hidden (non-observable) variable. An EM-Based training algorithm can generate optimal clusters of environments and therefore it is not necessary to label a database in terms of environment. For special cases, the technique is compared to utterance-by-utterance cepstral mean normalization (CMN) technique and show performance improvement on a noisy speech telephone database.
In accordance with another feature of the present invention under maximum-likelihood (ML) criterion, by application of EM algorithm and extension of Baum-Welch forward and backward variables and algorithm, a joint solution to the parameters for the source normalization is obtained, i.e. the canonical distributions, the transformations and the biases.
For a better understanding of the present invention, reference will now be made, by way of example, to the accompanying drawings, in which:
The training is done on a computer workstation having a monitor 11, a computer workstation 13, a keyboard 15, and a mouse or other interactive device 15a, as shown in Fig. 1. The system maybe connected to a separate database represented by database 17 in Fig. 1 for storage and retrieval of models.
By the term "training" we mean herein to fix the parameters of the speech models according to an optimum criterion. In this particular case, we use HMM (Hidden Markov Model) models. These models are as represented in Fig. 2 with states A, B, and C and transitions E, F, G, H, I and J between states. Each of these states has a mixture of Gaussian distributions 18 represented by Fig. 3. We are training these models to account for different environments. By environment we mean different speaker, handset, transmission channel, and noise background conditions. Speech recognizers suffer from environment variability because trained model distributions may be biased from testing signal distributions because of environment mismatch and trained model distributions are flat because they are averaged over different environments. For the first problem, the environmental mismatch can be reduced through model adaptation, based on utterances collected in the testing environment. Applicant's teaching herein is to solve the second problem by removing the environmental factors from the speech signal during the training procedure. This is source normalization training according to the present invention. A maximum likelihood (ML) linear regression (LR) solution to the environmental problem is provided herein where the environment is modeled as a hidden (non observable) variable.
A clean speech pattern distribution 40 will undergo complex distortion with different environments as shown in Fig. 4. The two axes represent two parameters which may be, for example, frequency, energy, formant, spectral, or cepstral components. The Fig. 4 illustrates a change at 41 in the distribution due to background noise or a change in speakers. The purpose of the application is to model the distortion.
The present model assumes the following: 1) the speech signal x is generated by Continuous Density Hidden Markov Model (CDHMM), called source distributions; 2) before being observed, the signal has undergone an environmental transformation, drawn from a set of transformations, where Wje is the transformation on the HMM state j of the environment e; 3) such a transformation is linear, and is independent of the mixture components of the source; and 4) there is a bias vector bke at the k-th mixture component due to environment e.
What we observe at time t is:
Our problem now is to find, in the maximum likelihood (ML) sense, the optimal source distributions, the transformation and the bias set.
In the prior art (A. Acero, et al. cited above and T. Anastasakos, et al. cited above), the environment e must be explicit, e.g.: speaker identity, male/female. An aspect of the present invention overcomes this limitation by allowing an arbitrary number of environments which are optimally trained.
Let N be the number of HMM states, M be the mixture number, L be the number of environments, Ωs {1, 2, ... N} be the set of states Ωm {1, 2, ... M} be the set of mixture indicators, and Ωe {1, 2, ... L} be the set of environmental indicators.
For an observed speech sequence of T vectors: O o
Referring to Fig. 1, the workstation 13 including a processor contains a program as illustrated that starts with an initial standard HMM model 21 which is to be refined by estimation procedures using Baum-Welch or Estimation-Maximization procedures 23 to get new models 25. The program gets training data at database 19 under different environments and this is used in an iterative process to get optimal parameters. From this model we get another model 25 that takes into account environment changes. The quantities are defined by probabilities of observing a particular input vector at some particular state for a particular environment given the model.
The model parameters can be determined by applying generalized EM-procedure with three types of hidden variables: state sequence, mixture component indicators, and environment indicators. (A. P. Dempster, N. M. Laird, and D. B. Rubin, entitled "Maximum Likelihood from Incomplete Data via the EM Algorithm," Journal of the Royal Statistical Society, 39 (1): 1-38, 1977.) For this purpose, Applicant teaches the CDHMM formulation from B, Juang, "Maximum-Likelihood Estimation for Mixture Multivariate Stochastic Observation of Markov Chains" (The Bell System Technical Journal, pages 1235-1248, July-August 1985) to be extended to result in the following paragraphs: Denote:
The speech is observed as a sequence of frames (a vector). Equations 7, 8, and 9 are estimations of intermediate quantities. For example, in equation 7 is the joint probability of observing the frames from times 1 to t at the state j at time t and for the environment of e given the model λ.
The following re-estimation equations can be derived from equations 2, 7, 8, and 9.
For the EM procedure 23, equations 10-21 are solutions for the quantities in the model.
Therefore µjk and bke can be simultaneously obtained by solving the linear system of N+L variables.
The model is specified by the parameters. The new model is specified by the new parameters.
As illustrated in Figs. 1 and 5, we start with an initial as standard model 21 such as the CDHMM model with initial values. This next step is the Estimation Maximization 23 procedure starting with (Step 23a) equations 7-9 and reestimation (Step 23b) equations 10-13 for initial state probability, transition probability, mixture component probability and environment probability.
The next step (23c) is to derive a means vector and a bias vector by introducing two additional equations 14 and 15 and equation 16-20. The next step 23d is to apply linear equations 21 and 22 and solve 21 and 22 jointly for mean vectors and bias vectors and at the same time calculate the variance using equation 23. Using equation 24 which is a system of linear equations will solve for transformation parameters using quantities given by equation 25 and 26. Then we have solved for all the model parameters. Then replace the old model parameters by the newly calculated ones (Step 24). Then the process is repeated for all the frames. When this is done for all the frames of the database a new model is formed and then the new models are re-evaluated using the same equation until there is no change beyond a predetermined threshold (Step 27).
After a source normalization training model is formed, this model is used in a recognizer as shown in Fig. 6 where input speech is applied to a recognizer 60 which used the source normalized HMM model 61 created by the above training to achieve the response.
The recognition task has 53 commands of 1-4 words. ("call return", "cancel call return", "selective call forwarding", etc.). Utterances are recorded through telephone lines, with a diversity of microphones, including carbon, electret and cordless microphones and hands-free speaker-phones. Some of the training utterances do not correspond to their transcriptions. For example: "call screen" (cancel call screen), "matic call back" (automatic call back), "call tra" (call tracking).
The speech is 8kHz sampled with 20ms frame rate. The observation vectors are composed of LPCC (Linear Prediction Coding Coefficients) derived 13-MFCC (Mel-Scale Cepstral Coefficients) plus regression based delta MFCC. CMN is performed at the utterance level. There are 3505 utterances for training and 720 for speaker-independent testing. The number of utterances per call ranges between 5-30.
Because of data sparseness, besides transformation sharing among states and mixtures, the transformations need to be shared by a group of phonetically similar phones. The grouping, based on an hierarchical clustering of phones, is dependent on the amount of training (SN) or adaptation (AD) data, i.e., the larger the number of tokens is, the larger the number of transformations. Recognition experiments are run on several system configurations:
Based on the results summarized in Table 1, we point out:
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
标题 | 发布/更新时间 | 阅读量 |
---|---|---|
基于主动采样和高斯混合模型的图像超分辨重建方法 | 2020-05-08 | 533 |
一种多重不确定性的水库调度规则提取方法及系统 | 2020-05-11 | 735 |
一种基于WiFi多维参数特征的室内目标被动跟踪方法 | 2020-05-12 | 542 |
非临床环境下对高血压进行非干预式的监测和评估方法 | 2020-05-08 | 179 |
一种基于贝叶斯模型的SRAM电路良率分析方法 | 2020-05-15 | 910 |
具有透镜天线阵的MIMO系统的稀疏信道估计方法 | 2020-05-18 | 260 |
毫米波大规模MIMO系统的波束空间信道估计方法 | 2020-05-20 | 395 |
基于块混合高斯低秩矩阵分解的毫米波图像异物检测方法 | 2020-05-17 | 342 |
基于关键点检测网络和空间约束混合模型的海水区域检测方法 | 2020-05-20 | 73 |
基于先验估计网络和空间约束混合模型的海面障碍物检测方法 | 2020-05-21 | 762 |
高效检索全球专利专利汇是专利免费检索,专利查询,专利分析-国家发明专利查询检索分析平台,是提供专利分析,专利查询,专利检索等数据服务功能的知识产权数据服务商。
我们的产品包含105个国家的1.26亿组数据,免费查、免费专利分析。
专利汇分析报告产品可以对行业情报数据进行梳理分析,涉及维度包括行业专利基本状况分析、地域分析、技术分析、发明人分析、申请人分析、专利权人分析、失效分析、核心专利分析、法律分析、研发重点分析、企业专利处境分析、技术处境分析、专利寿命分析、企业定位分析、引证分析等超过60个分析角度,系统通过AI智能系统对图表进行解读,只需1分钟,一键生成行业专利分析报告。