Structured Discriminative Models For Speech Recognition

2y ago
26 Views
2 Downloads
990.94 KB
53 Pages
Last View : 18d ago
Last Download : 3m ago
Upload by : Axel Lin
Transcription

Structured Discriminative Models for SpeechRecognitionMark Gales - work with Anton Ragni, Austin Zhang, Rogier van DalenApril 2012Cambridge University Engineering DepartmentNTT Visit

Structured Discriminative Models for Speech RecognitionOverview Acoustic Models for Speech Recognition– dependency modelling– generative and discriminative models Sequence (dynamic) kernels– discrete and continuous observation forms Combining Generative and Discriminative Models– generative score-spaces and log-linear models Training Criteria– large-margin-based training Initial Evaluation– AURORA-2 and AURORA-4 experimental resultsCambridge UniversityEngineering DepartmentNTT Visit1

Structured Discriminative Models for Speech RecognitionAcoustic ModelsCambridge UniversityEngineering DepartmentNTT Visit2

Structured Discriminative Models for Speech RecognitionDependency Modelling for Speech Recognition Sequence kernels for text-independent speaker verification used GMMs– for ASR interested modelling inter-frame dependencies Dependency modelling essential part of modelling sequence data:p(o1, . . . , oT ; λ) p(o1; λ)p(o2 o1; λ) . . . p(oT o1, . . . , oT 1; λ)– impractical to directly model in this form Two possible forms of conditional independence used:– observed variables– latent (unobserved) variables Even given dependencies (form of Bayesian Network):– need to determine how dependencies interactCambridge UniversityEngineering DepartmentNTT Visit3

Structured Discriminative Models for Speech RecognitionHidden Markov Model - A Dynamic Bayesian Networko1o2b3()b2()1 a12o3 o42a23a223 a34a33oTb4()4a45 5a44(a) Standard HMM phone topologyqtqt 1otot 1(b) HMM Dynamic Bayesian Network Notation for DBNs [1]:circles - continuous variables shaded - observed variablessquares - discrete variablesnon-shaded - unobserved variables Observations conditionally independent of other observations given state. States conditionally independent of other states given previous states.p(O; λ) TXYP (qt qt 1)p(ot qt; λ)q t 1Cambridge UniversityEngineering DepartmentNTT Visit4

Structured Discriminative Models for Speech RecognitionHMM Trajectory Modellingshowdhaxgrihddliyz50 5 10MFCC c1Frames from phrase:SHOW THE GRIDLEY’S. 15Legend True HMM 20 25 30 35Cambridge UniversityEngineering DepartmentNTT Visit204060Frame Index801005

Structured Discriminative Models for Speech RecognitionDependency Modelling using Observed Variablesqt 1qtqt 1qt 2ot 1otot 1ot 2 Commonly use member (or mixture) of the exponential familyTY 1Tp(O; α) exp α φ(ot n, . . . , ot, qt)Ztt 1– φ(ot n, . . . , ot) are the sufficient statistic from window of n frames– α are the natural parameters, Zt the (local) normalisation termZ TZt exp α φ(ot n, . . . , ot) dot n, . . . , dot What is the appropriate form of statistics (φ(O)) - needs DBN to be knownCambridge UniversityEngineering DepartmentNTT Visit6

Structured Discriminative Models for Speech RecognitionDiscriminative Models Classification requires class posteriors P (w O)– Generative model - e.g. HMM previously discussedp(O w; λ)P (w)PP (w O; λ) w̃ p(O w̃; λ)P (w̃)– Discriminative model - directly model posterior Log-Linear Model discriminative form of interest here 1TP (w O; α) exp α φ(O, w)Z– normalisation term Z (simpler to compute than generative model)Z Xw̃Texp α φ(O, w̃) BUT still need to decide form of features φ(O, w)Cambridge UniversityEngineering DepartmentNTT Visit7

Structured Discriminative Models for Speech RecognitionSequence Discriminative Models Applying discriminative models to speech data is non-trivial:1. Number of possible classes is vast– motivates the use of structured discriminative models2. Length of observation O varies from utterance to utterance– motivates the use of sequence kernels to obtain features3. Number of labels (words) and observations (frames) differ– addressed by combining solutions to (1) and (2) To handle these a segmentation a is often required A range of features are then possible based on:– word sequences φ(w) - “language-model”-like– segmentation-word sequences φ(a, w) - “pronunciation-model”-like– segmentation-observation sequences φ(O{ai}, aii ) - “acoustic-model”-likeCambridge UniversityEngineering DepartmentNTT Visit8

Structured Discriminative Models for Speech RecognitionCode-Breaking Style Rather than handle complete sequence - split into segments– perform simpler classification for each segment– complexity determined by segment (simplest word)1. Using HMM-based hypothesis– word L2. Foreach segment of a:– binary SVMs voting(ω) T– arg max αφ(O{ai}, ω)ω {ONE,.,SIL} Limitations of code-breaking approach [2]– each segment is treated independently– restrict to one segmentation, generated by HMMsCambridge UniversityEngineering DepartmentNTT Visit9

Structured Discriminative Models for Speech RecognitionExample Standard Sequence Modelsqtqqtqqtqoooooot 1tt 1tHMMt 1t 1MEMMtt 1t 1(H)CRF The segmentation, a, determines the state-sequence q– maximum entropy Markov model [3]TY 1TP (q O) exp α φ(qt, qt 1, ot)Zt 1 t– hidden conditional random field (simplified linear form only) [4]T 1YTexp α φ(qt, qt 1, ot)P (q O) Z t 1Cambridge UniversityEngineering DepartmentNTT Visit10

Structured Discriminative Models for Speech RecognitionFeatures Discriminative sequence models have simple sufficient statistics– simple models - second-order statistics (almost) a discriminative HMM– simplest approach extend frame features (for each state si) δ(qt, si) δ(qt, si)δ(qt 1, sj ) δ(qt, si)otφ(qt, qt 1, ot) δ(qt, si)ot otδ(qt, si)ot ot ot– still same conditional independence assumption as HMMHow to extend range of features? Consider features a particular segment of speech– size of each segment may vary from segment to segment– need to map to a fixed dimensionality independent of number of framesCambridge UniversityEngineering DepartmentNTT Visit11

Structured Discriminative Models for Speech RecognitionFlat Direct Models s the dog chased the cat /s o1.ot 1 ot ot 1.oT Remove conditional independence assumptions 1TP (w O) exp α φ(O, w)Z Simple model, but lack of structure causes problems– extracted feature-space becomes vast (number of possible sentences)– associated parameter vector is vast– large number of unseen examplesCambridge UniversityEngineering DepartmentNTT Visit12

Structured Discriminative Models for Speech RecognitionStructured Discriminative Models.dogoi 1oi 2.chasedojoj 1oj 2oτ. Introduce structure into observation sequence - segmentation a– comprises: segmentation identity ai, set of observations O{a} a X1XTP (w O) exp α φ(O{aτ }, aiτ ) Z aτ 1– segmentation may be at word, (context-dependent) phone, etc etc What form should φ(O{aτ }, aiτ ) have?– must be able to handle variable length O{aτ }Cambridge UniversityEngineering DepartmentNTT Visit13

Structured Discriminative Models for Speech RecognitionSequence KernelsCambridge UniversityEngineering DepartmentNTT Visit14

Structured Discriminative Models for Speech RecognitionSequence Kernel Sequence kernels are a class of kernel that handles sequence data– also applied in a range of biological applications, text processing, speech– in this talk these kernels will be partitioned into three classes Discrete-observation kernels– appropriate for text data– string kernels simplest form Distributional kernels– distances between distributions trained on sequences Generative kernels:– parametric form: use the parameters of the generative model– derivative form: use the derivatives with respect to the model parametersCambridge UniversityEngineering DepartmentNTT Visit15

Structured Discriminative Models for Speech RecognitionString Kernel For speech and text processing input space has variable dimension:– use a kernel to map from variable to a fixed length;– string kernels are an example for text [5]. Consider the words cat, cart, bar and a character string kernelφ(cat)φ(cart)φ(bar)c-a110K(cat, cart) 1 λ3,c-tλλ20c-r0λ0a-r011K(cat, bar) 0,r-t010b-a001b-r00λK(cart, bar) 1 Successfully applied to various text classification tasks:– how to make process efficient (and more general)?Cambridge UniversityEngineering DepartmentNTT Visit16

Structured Discriminative Models for Speech RecognitionRational Kernels Rational kernels [6] encompass various standard feature-spaces and kernels:– bag-of-words and N-gram counts, gappy N-grams (string Kernel), A transducer, T , for the string kernel (gappy bigram) (vocab {a, b})b: ε/1b: ε/λb: ε/1a: ε/1a: ε/λa: ε/11a:a/1b:b/1 2a:a/13/1b:b/1The kernel is: K(Oi, Oj ) w Oi (T T 1) Oj This form can also handle uncertainty in decoding: – lattices can be used rather than the 1-best output (Oi). Can also be applied for continuous data kernels [7].Cambridge UniversityEngineering DepartmentNTT Visit17

Structured Discriminative Models for Speech RecognitionGenerative Score-Spaces Generative kernels use scores of the following form [8]φ(O; λ) [log(p(O; λ))]– simplest form maps sequence to 1-dimensional score-space Parametric score-space increase the score-space size (1) λ̂φ(O; λ) . λ̂(K)– parameters estimated on O : related to the mean-supervector kernel Derivative score-space take the following formφ (O; λ) [ λ log (p(O; λ))]– using the appropriate metric this is the Fisher kernel [9]Cambridge UniversityEngineering DepartmentNTT Visit18

Structured Discriminative Models for Speech RecognitionGenerative Kernels Associated kernel for generative score-spaces is:K(Oi, Oj ; λ) φ(Oi; λ)TG-1φ(Oj ; λ)– φ(O; λ) is the score-space for O using parameters λ– G is the appropriate metric for the score-space The exact form of the metric is important– standard form is a maximally non-committal metricµg E {φ(O; λ)} ; G Σg E (φ(O; λ) µg)(φ(O; λ) µg)T– empirical approximation based on training data is often used– equal “weight” given to all dimensions– Fisher kernel with ML-trained models G Fisher Information MatrixCambridge UniversityEngineering DepartmentNTT Visit19

Structured Discriminative Models for Speech RecognitionCombining Generative &Discriminative ModelsCambridge UniversityEngineering DepartmentNTT Visit20

Structured Discriminative Models for Speech RecognitionCombining Discriminative and Generative ationRecognitionHypothesesλHypothesesGenerative ϕ (O , λ) DiscriminativeScore SpaceClassifierOFinalHypotheses Use generative model to extract features [9, 8] (we do like HMMs!)– adapt generative model - speaker/noise independent discriminative model Use favourite form of discriminative classifier for example– log-linear model/logistic regression– binary/multi-class support vector machinesCambridge UniversityEngineering DepartmentNTT Visit21

Structured Discriminative Models for Speech RecognitionScore-Space Sufficient Statistics Need a systematic approach to extracting sufficient statistics– what about using the sequence-kernel score-spaces?φ(O) φ(O; λ)– does this help with the dependencies? For an HMM the mean derivative elements become µ(jm) log(p(O; λ)) TXP (qt {θj , m} O; λ)Σ(jm)-1(ot µ(jm))t 1– state/component posterior a function of complete sequence O– introduces longer term dependencies– different conditional-independence assumptions than generative modelCambridge UniversityEngineering DepartmentNTT Visit22

Structured Discriminative Models for Speech RecognitionScore-Space Dependencies Consider a simple 2-class, 2-symbol {A, B} problem:– Class ω1: AAAA, BBBB– Class ω2: AABB, BBAAP(A) 0.5P(B) 0.511.020.5P(A) 0.5P(B) 0.530.50.5FeatureLog-Lik 2A 2A T2A 2A T3A40.5Class ω1AAAA BBBB-1.11 -1.110.50 -0.50-3.83 0.17-0.17 -0.17Class ω2AABB BBAA-1.11 -1.110.33 -0.33-3.28 -0.61-0.06 -0.06 ML-trained HMMs are the same for both classes First derivative classes separable, but not linearly separable– also true of second derivative within a state Second derivative across state linearly separableCambridge UniversityEngineering DepartmentNTT Visit23

Structured Discriminative Models for Speech RecognitionScore-Spaces for ASR Forms of score-space used in the experiments: (1) log p(O; λ )(i)log p(O; λ ) . ; φb1µ(O; λ) φa0(O; λ) (i) logp(O;λ)(i)(K)µlog p(O; λ )– appended log-likelihood: φa0(O; λ)– derivative (means only for class ωi): φb1µ(O; λ) (i)b– log-likelihood (for class ωi): φ0(O; λ) log p(O; λ ) In common with most discriminative models Joint Feature Spaces, P a i(1))φ(O{aτ }; λ)τ 1 δ(aτ , w .φ(O, a; λ) P a i(P ))φ(O{aτ }; λ)τ 1 δ(aτ , wfor α-tied yielding “units” {w(1), . . . , w(P )}, underlying score-space φ(O; λ).Cambridge UniversityEngineering DepartmentNTT Visit24

Structured Discriminative Models for Speech RecognitionJoint Feature-Space Example log P (o; λ 1 ) !“ONE” "#"#"K #logP(o;λ)% Generative features .60.“ONE”ONE Size of joint feature-space is the product of7 0“Th ”“Three” 1. feature-space size (K)- determined by generative model2. number of α classes (P) - determined by discriminative model Segmentation of the sentence will alter scoresCambridge UniversityEngineering DepartmentNTT Visit25

Structured Discriminative Models for Speech RecognitionSegmentation.dogchased . /d//ch//ao//g/. o . o o . o o . o o . o.tii 1jj 1ττ 1k Segmentation can be viewed at multiple levels––––sentence: yields flat direct model - standard problemsword: easy implementation for small vocab, sparsity issuesphone: may be context-dependentstate: very flexible, but large number of segments Multiple levels of segmentation can be used/combined– multiple segmentations can be used to derive features– can use different segmentations for generative/discriminative modelsCambridge UniversityEngineering DepartmentNTT Visit26

Structured Discriminative Models for Speech RecognitionParameter Tying Parameter tying in combined classifier [10]– two sets of parameters discriminative α, generative λGenerative model λState ClusteredDiscriminative model αModel ClusteredLeft V?NYLeft C?YLeft F?YLeft F?NNRight F?Left N?NYYNYNCombined ModelIntersect of both trees– tree-intersect can cause generalisation problemsCambridge UniversityEngineering DepartmentNTT Visit27

Structured Discriminative Models for Speech RecognitionHandling Latent Variables Two forms of model can be used:1. marginalise over all possible segmentations a X1XTP (w O) exp α φ(O{aτ }, aiτ ) Z aτ 12. use “best” segmentation â X1Tφ(O{âi}, âiτ ) P (w O, â) exp α Zτ 1 a Xâ argmax exp αT φ(O{aτ }, aiτ ) aτ 1Cambridge UniversityEngineering DepartmentNTT Visit28

Structured Discriminative Models for Speech RecognitionApproximate Training/Inference Schemes If HMMs are being used anyway - use for segmentation O(T )– simplest approach use Viterbi (1-best) segmentation from HMM, âhmm– use fixed segmentation in training and test - highly efficient âhmm 1 YTiP (w O) exp α φ(O{âhmmτ }, âhmmτ )Z τ 1âhmm argmax {p(O a, λ)P (a)}a Assumption: segmentation not dependent on discriminative model parameters– unclear how accurate appropriate this is! Schemes for efficient inference feature extraction possible [11]Cambridge UniversityEngineering DepartmentNTT Visit29

Structured Discriminative Models for Speech RecognitionHandling Speaker/Noise Differences A standard problem with kernel-based approaches is adaptation/robustness– not a problem with generative kernels– adapt generative models using model-based adaptation Standard approaches for speaker/environment adaptation– (Constrained) Maximum Likelihood Linear Regression [12]xt Aot b;µ(m) Aµx(m) b– Vector Taylor Series Compensation [13] (used in this work)µ(m) (m)-1 (m)-1(m) C log exp(C (µx µh )) exp(C µn ) Adapting the generative model will alter score-spaceCambridge UniversityEngineering DepartmentNTT Visit30

Structured Discriminative Models for Speech RecognitionTraining CriteriaCambridge UniversityEngineering DepartmentNTT Visit31

Structured Discriminative Models for Speech RecognitionSimple MMIE Example HMMs are not the correct model - discriminative criteria a possibilityMLE SOLUTION (DIAGONAL)32.52.5221.51.5110.50.500 0.5 0.5 1 4 2024MMIE SOLUTION368 1 4 202468 Discriminative criteria a function of posteriors P (w O; λ)– use to train the discriminative model parameters αCambridge UniversityEngineering DepartmentNTT Visit32

Structured Discriminative Models for Speech RecognitionDiscriminative Training Criteria Apply discriminative criteria to train discriminative model parametersα– Conditional Maximum Likelihood (CML) [14, 15]: maximiseR1X(r)Fcml(α) log(P (wref O(r); α))R r 1– Minimum Classification Error (MCE) [16]: minimiseR (r) ̺ 1(r)1 X O; α) P(wrefFmce(α) 1 P(r); α)R r 1(r) P (w Ow6 wref– Minimum Bayes’ Risk (MBR) [17, 18]: minimiseR1 XX(r)P (w O(r); α)L(w, wref)Fmbr(α) R r 1 wCambridge UniversityEngineering DepartmentNTT Visit33

Structured Discriminative Models for Speech RecognitionMBR Loss Functions for ASR Sentence (1/0 loss):(r)L(w, wref) ((r)1; w 6 wref(r)0; w wrefWhen ̺ 1, Fmce(α) Fmbr(α) Word: directly related to minimising the expected Word Error Rate (WER)– normally computed by minimising the Levenshtein edit distance. Phone: consider phone rather word loss– improved generalisation as more “error’s” observed– this is known as Minimum Phone Error (MPE) training [19, 20]. Hamming (MPFE): number of erroneous frames measured at the phone levelCambridge UniversityEngineering DepartmentNTT Visit34

Structured Discriminative Models for Speech RecognitionLarge Margin Based Criterialog posterior ratio Standard criterion for SVMs– improves generalisationBEYOND MARGIN Require 111111 P (wref O; α)min logw6 wrefP (w O; α)to be beyond margin As sequences being used can make margin function of the “loss” - minimise!)#("R(r)1XP (wref O(r); α)(r)Flm(α) max L(w, wref) logR r 1 w6 w(r)P (w O(r); α) refuse hinge-loss [f (x)] . Many variants possible [21, 22, 23, 24]Cambridge UniversityEngineering DepartmentNTT Visit35

Structured Discriminative Models for Speech RecognitionRelationship to (Structured) SVM Commonly add a Gaussian prior for regularisationF(α) log (N (α; µα; Σα)) Flm(α) Make the posteriors a log-linear model (α) with generative score-space (λ) [25]– restrict parameters of the prior: N (α; µα; Σα) N (α; 0, CI)RCX1F(α) α 2 2R r 1"max(r)w6 wref((r)L(w, wref) log(r)α φ(O , wref; λ)αTφ(O(r), w; λ)T(r)!)# Standard result - it’s a structured SVM [26, 25]Cambridge UniversityEngineering DepartmentNTT Visit36

Structured Discriminative Models for Speech RecognitionStructured SVM Training Training α, so that αTφ(O, w) is max for correct reference wref:, “1 2 3”, “1 2 3”, “00 0 0”0Training Sample 1, “9 9 9”“1 2 3” , “44566”, “4 5 6”, “0 0 0”O(1)(1) ,w refTraining Sample nO((nn) ) ,, “9 9 9”“44566”w ref General unconstrainedform: use cutting plane algorithmto solve [27, 28], “A“ A A””,, “BB A AA”convex, “A B C” linear } n h} { zzo{(1)inX(1)1C(O , w )(r)(r)αBTC”φ(O(r), wref) max L(w, wref) αTφ(O(r), w) ref α 2 , “A(r)2R r 1 “Aw6 w refAAAA”, “A, Cambridge UniversityEngineering DepartmentNTT Visit,37((R),(R)ref)

Structured Discriminative Models for Speech RecognitionHan

Structured Discriminative Models for Speech Recognition Combining Discriminative and Generative Models Test Data ϕ( , )O λ λ Compensation Adaptation/ Generative Discriminative HMM Canonical O λ Hypotheses λ Hypotheses Score Space Recognition O Hypotheses Final O Classifier Use generative

Related Documents:

1 Generative vs Discriminative Generally, there are two wide classes of Machine Learning models: Generative Models and Discriminative Models. Discriminative models aim to come up with a \good separator". Generative Models aim to estimate densities to the training data. Generative Models ass

For the discriminative models: 1. This framework largely improves the modeling capability of exist-ing discriminative models. Despite some recent efforts in combining discriminative models in the random fields model [13], discrimina-tive model

Combining discriminative and generative information by using a shared feature pool. In addition to discriminative classify- . to generative models discriminative models have two main drawbacks: (a) discriminant models are not robust, whether. in

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

combining generative and discriminative learning methods. One active research topic in speech and language processing is how to learn generative models using discriminative learning approaches. For example, discriminative training (DT) of hidden Markov models (HMMs) fo

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

with significant cognitive disabilities that are clearly linked to grade-level academic content standards, promote access to the general curriculum and reflect professional judgment of the highest expectation possible. This document is a guide for parents, educators, school personnel, and other community members to support their work in teaching students with significant cognitive disabilities .