Dynamic Bayesian Networks For Audio-visual Speech Recognition

1y ago
10 Views
2 Downloads
1.05 MB
16 Pages
Last View : 13d ago
Last Download : 3m ago
Upload by : Francisco Tran
Transcription

EURASIP Journal on Applied Signal Processing 2002:11, 1–15c 2002 Hindawi Publishing Corporation Dynamic Bayesian Networks for Audio-VisualSpeech RecognitionAra V. NefianIntel Corporation, Microcomputer Research Labs, 2200 Mission College Blvd., Santa Clara, CA 95052-8119, USAEmail: ara.nefian@intel.comLuhong LiangIntel Corporation, Microcomputer Research Labs, Guanghua Road, 100020 Chaoyang District, Beijing, ChinaEmail: luhong.liang@intel.comXiaobo PiIntel Corporation, Microcomputer Research Labs, Guanghua Road, 100020 Chaoyang District, Beijing, ChinaEmail: xiaobo.pi@intel.comXiaoxing LiuIntel Corporation, Microcomputer Research Labs, Guanghua Road, 100020 Chaoyang District, Beijing, ChinaEmail: xiaoxing.liu@intel.comKevin MurphyComputer Science Division, University of California, Berkeley, Berkeley, CA 94720-1776, USAEmail: murphyk@cs.berkeley.eduReceived 30 November 2001 and in revised form 6 August 2002The use of visual features in audio-visual speech recognition (AVSR) is justified by both the speech generation mechanism, whichis essentially bimodal in audio and visual representation, and by the need for features that are invariant to acoustic noise perturbation. As a result, current AVSR systems demonstrate significant accuracy improvements in environments affected by acousticnoise. In this paper, we describe the use of two statistical models for audio-visual integration, the coupled HMM (CHMM) andthe factorial HMM (FHMM), and compare the performance of these models with the existing models used in speaker dependentaudio-visual isolated word recognition. The statistical properties of both the CHMM and FHMM allow to model the state asynchrony of the audio and visual observation sequences while preserving their natural correlation over time. In our experiments, theCHMM performs best overall, outperforming all the existing models and the FHMM.Keywords and phrases: audio-visual speech recognition, hidden Markov models, coupled hidden Markov models, factorial hidden Markov models, dynamic Bayesian networks.11. INTRODUCTIONThe variety of applications of automatic speech recognition(ASR) systems for human computer interfaces, telephony,and robotics has driven the research of a large scientific community in recent decades. However, the success of the currently available ASR systems is restricted to relatively controlled environments and well-defined applications such asdictation or small to medium vocabulary voice-based control commands (e.g., hand-free dialing). Often, robust ASRsystems require special positioning of the microphone withrespect to the speaker resulting in a rather unnatural human-machine interface. In recent years, together with the investigation of several acoustic noise reduction techniques, thestudy of visual features has emerged as attractive solutionto speech recognition under less constrained environments.The use of visual features in audio-visual speech recognition(AVSR) is motivated by the speech formation mechanismand the natural ability of humans to reduce audio ambiguity using visual cues [1]. In addition, the visual informationprovides complementary features that cannot be corruptedby the acoustic noise of the environment. The importance ofvisual features for speech recognition, especially under noisy

2EURASIP Journal on Applied Signal ProcessingVideo sequenceVideo featureextractionAudio itionAcoustic featureextractionFigure 1: The audio-visual speech recognition system.A1 V1A2 V2A3 V3A4 V4A5 V5Figure 2: The state transition diagram of a left-to-right HMM.2environments, has been demonstrated by the success of recent AVSR systems [2]. However, problems such as the selection of the optimal set of visual features, or the optimal models for audio-visual integration remain challenging researchtopics. In this paper, we describe a set of improvements to theexisting methods for visual feature selection and we focus ontwo models for isolated word audio-visual speech recognition: the coupled hidden Markov model (CHMM) [3] andthe factorial hidden Markov model (FHMM) [4], whichare special cases of the dynamic Bayesian networks [5]. Thestructure of both models investigated in this paper describesthe state synchrony of the audio and visual components ofspeech while maintaining their natural correlation over time.The isolated word AVSR system illustrated in Figure 1 is usedto analyze the performance of the audio-visual models introduced in this paper. First, the audio and visual features(Section 3) are extracted from each frame of the audio-visualsequence. The sequence of visual features, which describe themouth deformation over consecutive frames, is upsampledto match the frequency of the audio observation vectors. Finally, both the factorial and the coupled HMM (Section 4)are used for audio-visual integration, and their performancefor AVSR in terms of parameter complexity, computationalefficiency (Section 5), and recognition accuracy (Section 6)is compared to existing models used in current AVSR systems.2.RELATED WORKAudio-visual speech recognition has emerged in recent yearsas an active field, gathering researchers in computer vision,signal and speech processing, and pattern recognition [2].With the selection of acoustic features for speech recognitionwell understood [6], robust visual feature extraction and selection of the audio-visual integration model are the leadingresearch areas in audio-visual speech recognition.Visual features are often derived from the shape of themouth [7, 8, 9, 10]. Although very popular, these methodsrely exclusively on the accurate detection of the lip contourswhich is often a challenging task under varying illumination conditions and rotations of the face. An alternative approach is to obtain visual features from the transformed grayscale intensity image of the lip region. Several intensity or appearance modeling techniques have been studied, includingprincipal component analysis [9], linear discriminant analysis (LDA), discrete cosine transform (DCT), and maximumlikelihood linear transform [2]. Methods that combine shapeand appearance modeling were presented in [2, 11].Existing techniques for audio-visual (AV) integration[2, 10, 12], consist of feature fusion and decision fusionmethods. In feature fusion method, the observation vectors are obtained by the concatenation of the audio and visual features, that can be followed by a dimensionality reduction transform [13]. The resulting observation sequences aremodeled using a left-to-right hidden Markov model (HMM)[6] as described in Figure 2. In decision fusion systems theclass conditional likelihood of each modality is combined atdifferent levels (state, phone, or word) to generate an overallconditional likelihood used in recognition. Some of the mostsuccessful decision fusion models include the multi-streamHMM, the product HMM, or the independent HMM. Themulti-stream HMM [14] assumes that the audio and videosequences are state synchronous but, unlike the HMM forfeature fusion, allows the likelihood of the audio and visual observation sequences to be computed independently.This allows to weigh the relative contribution of the audioand visual likelihood to the overall likelihood based on thereliability of the corresponding stream at different levels ofacoustic noise. Although more flexible than the HMM, themulti-stream HMM cannot accurately describe the naturalstate asynchrony of the audio-visual speech. The audio-visualmulti-stream product HMM [11, 14, 15, 16, 17] illustrated inFigure 3, can be seen as an extension of the previous modelby representing each hidden state of the multi-stream HMMas a pair of one audio and one visual state. Due to its structure, the multi-stream product HMM allows for audio-videostate asynchrony, controlled through the state transition matrix of the model, and forces the audio and video streamsto be in synchrony at the model boundaries (phone level incontinuous speech recognition systems or word level in isolated word recognition systems). The audio-visual sequencescan also be modeled using two independent HMMs [2], onefor audio and one for visual features. This model extends thelevel of asynchrony between the audio and visual states of theprevious models, but fails to preserve the natural dependencyover time of the acoustic and visual features of speech.3

Dynamic Bayesian Networks for Audio-Visual Speech RecognitionA1 V1A2 V1A3 V1A4 V1A5 V1A1 V2A2 V2A3 V2A4 V2A5 V2A1 V3A2 V3A3 V3A4 V3A5 V3A1 V4A2 V4A3 V4A4 V4A5 V43(a)(c)A1 V5A2 V5A3 V5A4 V5A5 V5Figure 3: The state transition diagram of a product HMM.3.VISUAL FEATURE EXTRACTIONRobust location of the facial features, specially the mouth region, and the extraction of a discriminant set of visual observation vectors are the two key elements of the AVSR system.The cascade algorithm for visual feature extraction used inour AVSR system consists of the following steps: face detection, mouth region detection, lip contour extraction, mouthregion normalization and windowing, 2D-DCT and LDA coefficient extraction. Next, we will describe the steps of thecascade algorithm in more detail.The extraction of the visual features starts with the detection of the speaker’s face in the video sequence. The facedetector used in our system is described in [18]. The lowerhalf of the detected face (Figure 4a) is a natural choice for theinitial estimate of the mouth region.Next, LDA is used to assign the pixels in the mouth regionto the lip and face classes. LDA transforms the pixel valuesfrom the RGB chromatic space into a one-dimensional spacethat best separates the two classes. The optimal linear discriminant space [19] is computed off-line using a set of manually segmented images of the lip and face regions. Figure 4bshows a binary image of the lip segmentation from the lowerregion of the face in Figure 4a.The contour of the lips (Figure 4c) is obtained throughthe binary chain encoding method [20] followed by asmoothing operation. Figures 5a, 5b, 5c, 5d, 5e, 5f, and 5hshow several successful results of the lip contour extraction.Due to the wide variety of skin and lip tones, the mouth segmentation and therefore the lip contour extraction may result in inaccurate results (Figures 5i and 5j).The lip contour is used to estimate the size and the rotation of the mouth in the image plane. Using an affine trans-(b)(d)(e)Figure 4: (a) The lower region of the face used as an initial estimatefor the mouth location, (b) binary image representing the mouthsegmentation results, (c) the result of the lip contour extraction,(d) the scale and rotation normalized mouth region, (e) the resultof the normalized mouth region windowing.form a rotation and size normalized grayscale region of themouth (64 64 pixels) is obtained from each frame of thevideo sequence (Figure 4d). However, not all the pixels inthe mouth region have the same relevance for visual speechrecognition. In our experiments we found that, as expected,the most significant information for speech recognition iscontained in the pixels inside the lip contour. Therefore, weuse an exponential window w[x, y] exp( ((x x0 )2 (y y0 )2 ) )/σ 2 ), σ 12, to multiply the pixels values inthe grayscale normalized mouth region. The window of size64 64 is centered in the center of the mouth region (x0 , y0 ).Figure 4e illustrates the result of the mouth region windowing.Next, the normalized and windowed mouth region is decomposed into eight blocks of height 32 and width 16, andthe 2D-DCT transform is applied to each of these blocks. Aset of four 2D-DCT coefficients from a window of size 2 2in the lowest frequency in the 2D-DCT domain are extractedfrom each block. The resulting coefficients extracted are arranged in a vector of size 32.In the final stage of the video feature extraction cascade,the multi-class LDA [19] is applied to the vectors of 2D-DCTcoefficients. For our isolated word speech recognition system, the classes of the LDA are associated to the words available in the database. A set of 15 coefficients, correspondingto the most significant generalized eigenvalues of the LDAdecomposition are used as visual observation vectors.4.THE AUDIO-VISUAL MODELThe audio-visual models used in existing AVSR systems,as well as the audio-visual models discussed in this paper, are special cases of dynamic Bayesian networks (DBN)[5, 21, 22]. DBNs are directed graphical models of stochastic processes in which the hidden states are represented in4

4EURASIP Journal on Applied Signal Processing(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)Figure 5: Examples of the mouth contour extraction.···Backbone nodesthe backbone nodes. The observation probability is generallymodeled using a mixture of Gaussian components.Introduced for audio-only speech recognition, the···Mixture nodesObservation nodest 0,t 1,. . . t, . . .t T 2, t T 1multi-stream HMM a MSHMM the observationlikelihood is obtained as (MSHMM) became a popularmodel for multimodal sequences such as the audio-visualFigure 6: The audio-visual HMM.speech. Interms of individual variables or factors. A DBN is specifiedby a directed acyclic graph, which represents the conditionalindependence assumptions and the conditional probabilitydistributions of each node [23, 24]. With the DBN representation, the classification of the decision fusion models canbe seen in terms of independence assumptions of the transition probabilities and of the conditional likelihood of the observed and hidden nodes. Figure 6 represents an HMM as aDBN. The transparent squares represent the hidden discretenodes (variables), while the shaded circles represent the observed continuous nodes. Throughout this paper, we will refer to the hidden nodes conditioned over time as coupled orbackbone nodes and to the remaining hidden nodes as mixture nodes. The variables associated with the backbone nodesrepresent the states of the HMM, while the values of the mixture nodes represent the mixture component associated witheach of the state of the backbone nodes. The parameters ofthe HMM [6] are π(i) P q1 i , bt (i) P Ot qt i , (1)a(i j) P qt i qt 1 j ,where qt is the state of the backbone node at time t, π(i) isthe initial state distribution for state i, a(i j) is the state transition probability from state j to state i, and bt (i) representsthe probability of the observation Ot given the ith state ofbt (i) S s 1 sMi m 1 swi,mN Ost , µsi,m, Usi,m λs,(2) where S represents the total number of streams, λs ( s λs 1, λs 0) are the stream exponents, Ost is the observation vector of the sth stream at time t, Mis is the numbersof mixture components in stream s and state i, and µ ,i,msUsi,m , wi,mare the mean, covariance matrix, and mixtureweight for the sth stream, ith state, and mth Gaussian mixture component, respectively. The two streams (S 2) ofthe audio-visual MSHMM (AV MSHMM) model the audioand the video sequence. For the AV MSHMM, as well as forthe HMM used in video-only or audio-only speech recognition, all covariance matrices are assumed diagonal, and thetransition probability matrix reflects the left-to-right stateevolutiona(i j) 0,if i / { j, j 1}.(3)The audio and visual state synchrony imposed by the AVMSHMM can be relaxed using models that allow one hiddenbackbone node per stream at each time t. Figure 7 illustratesa two-stream independent HMM (IHMM) represented as aDBN. Let i {i1 , . . . , iS } be some set of states of the backbone nodes, Ns the number of states of the backbone nodesin stream s, qts the state of the backbone node in stream s attime t and qt {qt1 , . . . , qtS }. Formally, the parameters of an56

Dynamic Bayesian Networks for Audio-Visual Speech Recognition5are π(i) P q1 i , (7) bt (i) P Ot qt i , ···a(i j) P qt i qt 1 j ,·········t 1,. . . t, . . .t T 2,t T 1Figure 7: A two-stream independent HMM.···t 0,···t 1,. . . t, . . .t T 2,t T 1bt (i) a(i j) 7s s s π s is bts is s ssa is js P q1s is , (4) P Ost qts is , sO1t T , . . . , OSt TT.(10)The observation likelihood can be computed using a Gaussian density or a mixture with Gaussian components. Theuse of PHMM in AVSR is justified primarily because it allowsfor state asynchrony, since each of the coupled nodes can bein any combination of audio and visual states. In addition,unlike the IHMM, the PHMM preserves the natural correlation of the audio and visual features due the joint probability modeling of both the observation likelihood (see (8))and transition probabilities (see (9)). For the PHMM used inAVSR, denoted in this paper as the audio-visual PHMM (AVPHMM), the audio and visual state asynchrony is limited toa maximum of one state. Formally, the transition probabilitymatrix from state j [ ja , jv ] to state i [ia , iv ] is given bya(i j) 0IHMM are is / js , js 1 ,if ia iv 2,s {a, v},i j,(11)where indices a and v denote the audio and video stream,respectively. In the AV PHMM described in this paper(Figure 8) the observation likelihood is computed usingFigure 8: The audio-visual product HMM.π(i) (9)where Ot can be obtained through the concatenation of theobservation vectors in each streamOt t 0,(8) (5) P qts is qts 1 js ,(6)where π s (is ) and bts (is ) are the initial state distribution andthe observation probability of state is in stream s, respectively,and as (is js ) is the state transition from state js in state isin stream s. For the audio-visual IHMM (AV IHMM) eachof the two HMMs, describing the audio or video sequence,is constrained to a left-to-right structure, and the observation likelihood bts (i) is computed using a mixture of Gaussiandensity functions, with diagonal covariance matrices. The AVIHMM allows for more flexibility than the AV MSHMM inmodeling the state asynchrony but fails to model the naturalcorrelation in time between the audio and visual componentsof speech. This is a result of the independent modeling of thetransition probabilities (see (6)) and of the observation likelihood (see (5)).A product HMM (PHMM) can be seen as a standardHMM, where each backbone state is represented by a set ofstates, one for each stream [17]. The parameters of a PHMMbt (i) Mi m 1wi,m sN Ost , µsi,m, Usi,m λs,(12)where Mi represents the number of mixture components asssociated with state i, µand Usi,m are the mean and thei,mdiagonal covariance matrices corresponding to stream sgiven the state i and mixture component m, and wi,m arethe mixture weights corresponding to the state i. Unlikethe MSHMM (see (2)) the likelihood representation for thePHMM used in (12) models the stream observations jointlythrough the dependency of the same mixture node. In thispaper, the model parameters are trained for fixed values ofthe stream exponents λs 1. For testing, the stream exponents are chosen to maximize the average recognition rate atdifferent acoustic signal-to-noise ratio (SNR) levels. Since inthe PHMM both the transition and observation likelihoodare jointly computed, and in the IHMM both transition andobservation likelihood in each stream are independent, thesemodels can be considered extreme cases of a range of models that combine the joint and independent modeling of thetransition probabilities and observation likelihoods. Two ofthese models, namely the factorial HMM and the coupledHMM, and their application in audio-visual integration willbe discussed next.8

6EURASIP Journal on Applied Signal Processingt 0,············t 1,. . . t, . . .of the initial parameters of the model a critical issue. In thispaper, we present an efficient method for initialization using a Viterbi algorithm derived for the FHMM. The Viterbialgorithm for FHMMs is described below for an utteranceO1 , . . . , OT of length T.(i) Initializationt T 2,δ1 (i) π(i)b1 (i),ψ1 (i) 0;t T 1(17)(18)(ii) RecursionFigure 9: The audio-visual factorial HMM. δt (i) max δt 1 (j)a(i j) bt (i),j4.1. The audio-visual factorial hidden Markov modelThe factorial HMM (FHMM) [4] is a generalization of theHMM suitable for a large range of multimedia applicationsthat integrate two or more streams of data. The FHMM generalizes an HMM by representing the hidden state by a set ofvariables or factors. In other words, it uses a distributed representation of the hidden state. These factors are assumed tobe independent of each other, but they all contribute to theobservations, and hence become coupled indirectly due tothe “explaining away” effect [23]. The elements of a factorialHMM are described as π(i) P q1 i , (13) bt (i) P Ot qt i ,a(i j) s as is js (14) s P qts is qts 1 js .(15)It can be seen that as with the IHMM, the transition probabilities of the FHMM are computed using the independenceassumption between the hidden states or factors in each ofthe HMMs (see (15)). However, as with the PHMM, the observation likelihood is jointly computed from all the hiddenstates (see (14)). The observation likelihood can be computed using a continuous mixture with Gaussian components. The FHMM used in AVSR, denoted in this paper as theaudio-visual FHMM (AV FHMM), has a set of modificationsfrom the general model. In the AV FHMM used in this paper(Figure 9), the observation likelihoods are obtained from themulti-stream representation as described in (12). To modelthe causality in speech generation, the following constrainton the transition probability matrices of the AV FHMM isimposed: as is js 0, if is / js , js 1 ,(16)where s {a, v}.4.1.1 Training factorial HMMsAs is well known, DBNs can be trained using the expectationmaximization (EM) algorithm (see, e.g., [22]). The EM algorithm for the FHMM is described in Appendix A. However,this only converges to a local optimum, making the choice ψt (i) arg max δt 1 (j)a(i j) ;(19)j(iii) Termination P max δT (i) ,i qT arg max δT (i) ;(20)i(iv) Backtracking qt ψt 1 qt 1 ,(21)where P maxq1 ,.,qT P(O1 , . . . , OT , q1 , . . . , qT ), and a(i j)is obtained using (15). Note that, as with the HMM, theViterbi algorithm can be computed using the logarithms ofthe model parameters, and additions instead of multiplication.The initialization of the training algorithm iteratively updates the initial parameters of the model from the optimalsegmentation of the hidden states. The state segmentationalgorithm described in this paper reduces the complexity ofthe search for the optimal sequence of backbone and mixturenodes using the following steps. First, we use the Viterbi algorithm, as described above, to determine the optimal sequenceof states for the backbone nodes. Second, we obtain the mostlikely assignment to the mixture nodes. Given these optimalassignments to the hidden nodes, the appropriate sets of parameters are updated. For the FHMM with λs 1 and general covariance matrices the initialization of the training algorithm is described below.Step 1. Let R be the number of training examples and letOsr,1 , . . . , Osr,Tr be the observation sequence of length Tr corresponding to the sth stream of the rth (1 r R) trainingexample. First, the observation sequences Osr,1 , . . . , Osr,Tr areuniformly segmented according to the number of states ofthe backbone nodes Ns . Then, a new sequence of observationvectors is obtained by concatenating the observation vectorsassigned to each state is , s 1, . . . , S. For each state set i of thebackbone nodes, the mixture parameters are initialized usingthe K-means algorithm [19] with Mi clusters.Step 2. The new parameters of the model are estimated fromthe segmented data

Dynamic Bayesian Networks for Audio-Visual Speech Recognitionµsi,m sr,t γr,t (i, m)Or,t , Usi,m r,t γr,t (i, m)r,t γr,t (i, m) Osr,t µ s i,mOsr,t µsi,mr,t γr,t (i, m)r,twi,m r,t Table 1: The optimal set of exponents for the audio stream λa forthe FHMM at different SNR values of the acoustic speech. T,(22)γr,t (i, m), m γr,t (i, m ) s r,t(i, j)as (i j) r,t s,r,tl r,t (i, l)γr,t (i, m) 0,s r,t(i, j) 1, 0,4.2.if qr,t i, cr,t m,otherwise,ssif qr,t i, qr,t 1 j,otherwise,(23)swhere qr,trepresents the state of the tth backbone node inthe sth stream of the rth observation sequence, and cr,t isthe mixture component of the rth observation sequence attime t.Step 3. An optimal state sequence qr,1 , . . . , qr,Tr of the backbone nodes is obtained for the rth observation sequence using the Viterbi algorithm (see below). The mixture component cr,t is obtained as cr,t max P Or,t qr,t i, cr,t m .m 1,.,MiSNR(db) 30 28 26 24 22 20 18 16 14 12 10λa0.8 0.8 0.7 0.6 0.6 0.3 0.2 0.1 0.1 0.1 0.0audio exponents λa used in our system which were derivedfrom Figure 10. As expected, the value of the optimal audioexponents decays with the decay of the SNR levels, showingthe increased reliability of the video at low acoustic SNR.where 1,7The audio-visual coupled hidden Markov modelThe coupled HMM (CHMM) [3] is a DBN that allowsthe backbone nodes to interact, and at the same time tohave their own observations. In the past, CHMM have beenused to model hand gestures [3], the interaction betweenspeech and hand gestures [25], or audio-visual speech [26,27]. Figure 11 illustrates a continuous mixture two-streamCHMM used in our audio-visual speech recognition system.The elements of the coupled HMM are described asπ(i) bt (i) a(i j) (24)Step 4. The iterations in Steps 2, 3, and 4 are repeated until the difference between the observation probabilities of thetraining sequences at consecutive iterations falls below a convergence threshold.4.1.2 Recognition using the factorial HMMTo classify a word, the log likelihood of each model is computed using the Viterbi algorithm described in the previoussection. The parameters of the FHMM corresponding to eachword in the database are obtained in the training stage using clean audio signals (SNR 30 db). In the recognitionstage, the audio tracks of the testing sequences are alteredby white noise with different SNR levels. The influence ofthe audio and visual observation streams is weighted basedon the relative reliability of the audio and visual features fordifferent levels of the acoustic SNR. Formally, the observation likelihoods are computed using the multi-stream representation in (12). The values of the audio and visual exponents λs , s {a, v}, corresponding to a specific acousticSNR level are obtained experimentally to maximize the average recognition rate. Figure 10 illustrates the variation ofthe audio-visual speech recognition rate for different valuesof the audio exponent λa and different values of SNR. Notethat each of the AVSR curves at all SNR levels reaches smoothmaximum levels. This is particularly important in designingrobust AVSR systems and allows for the exponents to be chosen in a relatively large range of values. Table 1 describes the s s s π s is bts is s as is j s P q1s is ,(25) P Ost qts is , s(26) P qts is qt 1 j .(27)Note that in general, to decrease the complexity of the model,the dependency of a backbone node at time t is restricted toits neighbor backbone nodes at time t 1. As with the IHMM,in the CHMM the computation of the observation likelihoodassumes the independence of the observation likelihoods ineach stream. However, the transition probability of each coupled node is computed as joint probability of the set of statesat previous time. With the constraint as (is j) as (is js ) aCHMM is reduced to an IHMM.For the audio-visual CHMM (AV CHMM) the observation likelihoods of the audio and video streams are computedusing a mixture of Gaussians with diagonal covariance matrices, and the transition probability matrix is constrained toreflect the natural audio-visual speech dependencies s a is j 0 is / js , js 1 ,if is js 2, s s,(28)where s, s {a, v}. The CHMM relates also to the Boltzmann zipper [28] used in audio-visual speech recognition.The Boltzmann zipper consists of two linear Boltzmann networks connected such that they can influence each other.Figure 12 illustrates a Boltzmann zipper where each of theBoltzmann chains is represented as an HMM. Note thatalthough the connections between nodes within the sameBoltzmann chain can be seen as transition probabilities of anHMM, the connections between nodes of different chains donot have the same significance [19]. Due to its structure, the

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio exponent0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio 00Video onlySNR 18 db1009080706050403020100Recognition rate (%)Video onlySNR 16 db0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio exponentAudio-visualAudio only1009080706050403020100Video onlySNR 22 db0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio exponentAudio-visualAudio only1009080706050403020100Video onlySNR 28 db0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio exponentAudio-visualAudio only0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio exponentAudio-visualAudio only1009080706050403020100Video onlySNR 24 dbRecognition rate (%)Recognition rate (%)Audio-visualAudio only0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio exponentAudio-visualAudio onlyRecognition rate (%)1009080706050403020100Video onlySNR 30 dbRecognition rate (%)Recognition rate (%)Audio-visualAudio deo onlySNR 20 db0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio exponentAudio-visualAudio only0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio exponentAudio-visualAudio onlyVideo onlySNR 26 db0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Audio exponentAudio-visualAudio onlyRecognition rate (%)Recognition rate (%)1009080706050403020100Recognition rate (%)EURASIP Journal on Applied Signal ProcessingRecognition rate (%)8Video onlySNR 12 dbFigure 10: The FHMM recognition rate against SNR for different audio exponents.Video onlySNR 14 db

Dynamic Bayesian Networks for Audi

to speech recognition under less constrained environments. The use of visual features in audio-visual speech recognition (AVSR) is motivated by the speech formation mechanism and the natural ability of humans to reduce audio ambigu-ity using visual cues [1]. In addition, the visual information provides complementary features that cannot be .

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

765 S MEDIA TECHNOLOGY Designation Properties Page Audio Audio cables with braided shielding 766 Audio Audio cables, multicore with braided shielding 767 Audio Audio cables with foil shielding, single pair 768 Audio Audio cables, multipaired with foil shielding 769 Audio Audio cables, multipaired, spirally screened pairs and overall braided shielding 770 Audio Digital audio cables AES/EBU .

Learning Bayesian Networks and Causal Discovery Reasoning in Bayesian networks The most important type of reasoning in Bayesian networks is updating the probability of a hypothesis (e.g., a diagnosis) given new evidence (e.g., medical findings, test results). Example: What is the probability of Chronic Hepatitis in an alcoholic patient with

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

Key words Bayesian networks, water quality modeling, watershed decision support INTRODUCTION Bayesian networks A Bayesian network (BN) is a directed acyclic graph that graphically shows the causal structure of variables in a problem, and uses conditional probability distributions to define relationships between variables (see Pearl 1988, 1999;