Audio-Visual Automatic Speech Recognition

1y ago
14 Views
2 Downloads
2.90 MB
24 Pages
Last View : 25d ago
Last Download : 3m ago
Upload by : Lilly Andre
Transcription

Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAudio-Visual Automatic Speech RecognitionAcousticspeechVisual speechModelingHelge ReikerasExperimentalresultsConclusionJune 30, 2010SciPy 2010: Python for Scientific Computing Conference

Introduction 1/2Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual tegration of audio and visual speech modalities with thepurpose of enhanching speech recognition performance.Why?McGurk effect (e.g. visual /ga/ combined with an audio/ba/ is heard as /da/)Performance increase in noisy environmentsProgress in speech recognition seems to be stagnating

Introduction 2/2Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual : YouTube automatic captions

Acoustic speech: MFCCs (1/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual quency cepstrum coefficients (MFCCs).Cosine transform of the logarithm of the short-term energyspectrum of a signal, expressed on the mel-frequency scale.The result is a set of coefficients that approximates theway the human auditory system perceives sound.

Acoustic speech: MFCCs (2/2)Helge ReikerasIntroductionAcousticspeechVisual speechMel-frequency mentalresults0.20.00.20.40.00.51.0Time (sec)1.52.02.5

Visual speech: Active appearance models (1/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionVisual speech information mainly contained in the motionof visible articulators such as lips, tongue and jaw.

Active appearance models (shape) (2/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusions s0 NXi 1pi s i .(PCA)

Active appearance models (appearance) (3/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionA(x) A0 MXi 1λi Ai (x),x s0 .(PCA)

Facial feature tracking (1/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasMinimize difference between AAM and input image(warped onto the base shape s0 ).IntroductionAcousticspeechVisual speechModelingWarp is a piecewise affine transformation (triangulatedbase shape).Nonlinear least squares problemExperimentalresultsConclusion"argminXλ,px s0A0 (x) MX#2λi Ai (X) I (W(x; p))i 1Solve using non-linear numerical optimization methods.

Facial feature tracking (2/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusion

Modeling: Gaussian mixture modelsAudio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionGaussian Mixture Models (GMMs) provide a powerfulmethod for modeling data distributions.Weighted linear combination of Gaussian distributions.AcousticspeechVisual speechModelingp(x) k 1ExperimentalresultsConclusionKXData: xModel parameters:Weights πMeans µCovariances Σπk N (x µk , Σk )

Expectation maximization (EM) (1/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechLog likelihood function gives the likelihood of the dataX {x1 , x2 , ., xn } given GMM model parametersVisual speechModelingExperimentalresultsConclusionln p(X π, µ, Σ) NXn 1ln(KX)πk N (x n µk , Σk )k 1EM is an iterative algorithm for maximizing the loglikelihood function w.r.t. GMM parameters.

Expectation maximization (EM) (2/2)Audio-VisualAutomaticSpeechRecognitionVisual EM-GMM (16 mixture components)Helge ExperimentalresultsConclusionp4Visual speech50 5 10 15 20 15 10 50p35101520(Note that in practice we use more than 2 dimensional featurevectors)

Variational Bayesian (VB) inference (1/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionHow do we choose the number of Gaussian mixturecomponents?VB differs from EM in that parameters are modeled asrandom variables.Suitable conjugate priors for GMM parameters are:Weights; DirichletMeans: GaussianCovariances (precision): WishartAvoids overfitting, singular solutions (when a Gaussiancollapses onto a single data point) and leads to automaticmodel complexity selection.

Variational Bayesian (VB) inference (2/2)Audio-VisualAutomaticSpeechRecognitionVisual VB-GMM (16 mixture components)25Helge perimentalresultsConclusionp4Visual speech50 5 10 15 20 15 10 50p35101520Remaining components have converged to their priordistributions and been assigned zero weights.

Audio-visual fusionAudio-VisualAutomaticSpeechRecognitionHelge ReikerasAcoustic GMM: p(xA c)IntroductionVisual GMM: p(xV c)AcousticspeechVisual speechClassification (e.g. words or phonemes)Stream exponents λA , λVModelingScore(xAV c) p(xA c)λA p(xV c)λVExperimentalresults0 λA , λV 1ConclusionλA λV 1Learn stream weights discriminatively.Minimize misclassification rate on development set.

SummaryAudio-VisualAutomaticSpeechRecognitionHelge ReikerasAudio-onlyspeech recognitionIntroductionAcousticspeechAUDIOAcoustic featureextractionVisual speechAudio-visualspeech recognitionModelingExperimentalresultsVIDEOFace detection.Facial feature trackingVisual featureextractionConclusionVisual-onlyspeech recognition

Python onHelge ReikerasIntroductionAcousticspeechVisual nted in Python using SciPy (open source scientificcomputing Python library).Signal processing, computer vision and machine learningare active areas of development in the SciPy community.SciPy modules used:scikits.talkbox.features.mfcc (MFCCs)scikits.image (image processing)scipy.optimize.fmin ncg (facial feature tracking)scipy.learn.em (EM)New modules developed as part of this research:vb (VB inference)aam (AAMs)

Experimental results (1/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionUsing the Clemson University audio-visual experiments(CUAVE) database.Contains video of 36 speakers, 19 male and 17 female,uttering isolated and connected digits in frontal, profileand while moving.

Experimental results (2/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasAcousticspeechUse separate training, development and test data sets(1/3, 1/3, 1/3).Visual speechAdd acoustic noise ranging from -5dB to 25 dB.IntroductionModelingExperimentalresultsTest audio-only, visual-only and audio-visual classifiers fordifferent levels of acoustic noise.ConclusionEvaluate performance based on misclassification rate.

Experimental results e ReikerasIntroductionVisual ion 0.60.4Conclusion0.20.0 5051510Signal-to-noise ratio2025

lge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionVisual speech in itself does not contain sufficientinformation for speech recognition.but by combining visual and audio speech features weare able to achieve better performance than what ispossible with audio-only ASR.

Future workAudio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechSpeech features are not i.i.d. (hidden Markov models)(sprint)ModelingAudio and visual speech is asynchronous (dynamicBayesian networks) (GrMPy)ExperimentalresultsAdaptive stream weightingConclusion.Visual speech

The endAudio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionThank you!Any questions?

Speech Recognition Helge Reikeras Introduction Acoustic speech Visual speech Modeling Experimental results Conclusion Introduction 1/2 What? Integration of audio and visual speech modalities with the purpose of enhanching speech recognition performance. Why? McGurk effect (e.g. visual /ga/ combined with an audio /ba/ is heard as /da/)

Related Documents:

to speech recognition under less constrained environments. The use of visual features in audio-visual speech recognition (AVSR) is motivated by the speech formation mechanism and the natural ability of humans to reduce audio ambigu-ity using visual cues [1]. In addition, the visual information provides complementary features that cannot be .

In this paper we discuss the design, acquisition and preprocessing of a Czech audio-visual speech corpus. The corpus is intended for training and testing of existing audio-visual speech recognition system. The name of the database is UWB-07-ICAVR, where ICAVR stands for Impaired Condition Audio Visual speech Recognition.

speech or audio processing system that accomplishes a simple or even a complex task—e.g., pitch detection, voiced-unvoiced detection, speech/silence classification, speech synthesis, speech recognition, speaker recognition, helium speech restoration, speech coding, MP3 audio coding, etc. Every student is also required to make a 10-minute

Title: Arabic Speech Recognition Systems Author: Hamda M. M. Eljagmani Advisor: Veton Këpuska, Ph.D. Arabic automatic speech recognition is one of the difficult topics of current speech recognition research field. Its difficulty lies on rarity of researches related to Arabic speech recognition and the data available to do the experiments.

765 S MEDIA TECHNOLOGY Designation Properties Page Audio Audio cables with braided shielding 766 Audio Audio cables, multicore with braided shielding 767 Audio Audio cables with foil shielding, single pair 768 Audio Audio cables, multipaired with foil shielding 769 Audio Audio cables, multipaired, spirally screened pairs and overall braided shielding 770 Audio Digital audio cables AES/EBU .

speech recognition has acts an important role at present. Using the speech recognition system not only improves the efficiency of the daily life, but also makes people's life more diversified. 1.2 The history and status quo of Speech Recognition The researching of speech recognition technology is started in 1950s. H . Dudley who had

translation. Speech recognition plays a primary role in human-computer interaction, so speech recognition research has essential academic value and application value. Speech recognition refers to the conversion from audio to text. In the early stages of the research work, since it was impossible to directly model the audio-to-text con-

STM32 32-bit Cortex -M MCUs Releasing your creativity . What does a developer want in an MCU? 2 Software libraries Cost sensitive Advanced peripherals Scalable device portfolio Rich choice of tools Leading edge core Ultra-low-power . STM32 platform key benefits More than 450 compatible devices Releasing your creativity 3 . STM32 a comprehensive platform Flash size (bytes) Select your fit .