Audio-Visual Automatic Speech Recognition

1y ago

14 Views

2 Downloads

2.90 MB

24 Pages

Last View : 25d ago

Last Download : 3m ago

Upload by : Lilly Andre

Report this link

Download PDF

Transcription

Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAudio-Visual Automatic Speech RecognitionAcousticspeechVisual speechModelingHelge ReikerasExperimentalresultsConclusionJune 30, 2010SciPy 2010: Python for Scientific Computing Conference

Introduction 1/2Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual tegration of audio and visual speech modalities with thepurpose of enhanching speech recognition performance.Why?McGurk effect (e.g. visual /ga/ combined with an audio/ba/ is heard as /da/)Performance increase in noisy environmentsProgress in speech recognition seems to be stagnating

Introduction 2/2Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual : YouTube automatic captions

Acoustic speech: MFCCs (1/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual quency cepstrum coefficients (MFCCs).Cosine transform of the logarithm of the short-term energyspectrum of a signal, expressed on the mel-frequency scale.The result is a set of coefficients that approximates theway the human auditory system perceives sound.

Acoustic speech: MFCCs (2/2)Helge ReikerasIntroductionAcousticspeechVisual speechMel-frequency mentalresults0.20.00.20.40.00.51.0Time (sec)1.52.02.5

Visual speech: Active appearance models (1/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionVisual speech information mainly contained in the motionof visible articulators such as lips, tongue and jaw.

Active appearance models (shape) (2/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusions s0 NXi 1pi s i .(PCA)

Active appearance models (appearance) (3/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionA(x) A0 MXi 1λi Ai (x),x s0 .(PCA)

Facial feature tracking (1/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasMinimize difference between AAM and input image(warped onto the base shape s0 ).IntroductionAcousticspeechVisual speechModelingWarp is a piecewise affine transformation (triangulatedbase shape).Nonlinear least squares problemExperimentalresultsConclusion"argminXλ,px s0A0 (x) MX#2λi Ai (X) I (W(x; p))i 1Solve using non-linear numerical optimization methods.

Facial feature tracking (2/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusion

Modeling: Gaussian mixture modelsAudio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionGaussian Mixture Models (GMMs) provide a powerfulmethod for modeling data distributions.Weighted linear combination of Gaussian distributions.AcousticspeechVisual speechModelingp(x) k 1ExperimentalresultsConclusionKXData: xModel parameters:Weights πMeans µCovariances Σπk N (x µk , Σk )

Expectation maximization (EM) (1/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechLog likelihood function gives the likelihood of the dataX {x1 , x2 , ., xn } given GMM model parametersVisual speechModelingExperimentalresultsConclusionln p(X π, µ, Σ) NXn 1ln(KX)πk N (x n µk , Σk )k 1EM is an iterative algorithm for maximizing the loglikelihood function w.r.t. GMM parameters.

Expectation maximization (EM) (2/2)Audio-VisualAutomaticSpeechRecognitionVisual EM-GMM (16 mixture components)Helge ExperimentalresultsConclusionp4Visual speech50 5 10 15 20 15 10 50p35101520(Note that in practice we use more than 2 dimensional featurevectors)

Variational Bayesian (VB) inference (1/2)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionHow do we choose the number of Gaussian mixturecomponents?VB differs from EM in that parameters are modeled asrandom variables.Suitable conjugate priors for GMM parameters are:Weights; DirichletMeans: GaussianCovariances (precision): WishartAvoids overfitting, singular solutions (when a Gaussiancollapses onto a single data point) and leads to automaticmodel complexity selection.

Variational Bayesian (VB) inference (2/2)Audio-VisualAutomaticSpeechRecognitionVisual VB-GMM (16 mixture components)25Helge perimentalresultsConclusionp4Visual speech50 5 10 15 20 15 10 50p35101520Remaining components have converged to their priordistributions and been assigned zero weights.

Audio-visual fusionAudio-VisualAutomaticSpeechRecognitionHelge ReikerasAcoustic GMM: p(xA c)IntroductionVisual GMM: p(xV c)AcousticspeechVisual speechClassification (e.g. words or phonemes)Stream exponents λA , λVModelingScore(xAV c) p(xA c)λA p(xV c)λVExperimentalresults0 λA , λV 1ConclusionλA λV 1Learn stream weights discriminatively.Minimize misclassification rate on development set.

SummaryAudio-VisualAutomaticSpeechRecognitionHelge ReikerasAudio-onlyspeech recognitionIntroductionAcousticspeechAUDIOAcoustic featureextractionVisual speechAudio-visualspeech recognitionModelingExperimentalresultsVIDEOFace detection.Facial feature trackingVisual featureextractionConclusionVisual-onlyspeech recognition

Python onHelge ReikerasIntroductionAcousticspeechVisual nted in Python using SciPy (open source scientificcomputing Python library).Signal processing, computer vision and machine learningare active areas of development in the SciPy community.SciPy modules used:scikits.talkbox.features.mfcc (MFCCs)scikits.image (image processing)scipy.optimize.fmin ncg (facial feature tracking)scipy.learn.em (EM)New modules developed as part of this research:vb (VB inference)aam (AAMs)

Experimental results (1/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionUsing the Clemson University audio-visual experiments(CUAVE) database.Contains video of 36 speakers, 19 male and 17 female,uttering isolated and connected digits in frontal, profileand while moving.

Experimental results (2/3)Audio-VisualAutomaticSpeechRecognitionHelge ReikerasAcousticspeechUse separate training, development and test data sets(1/3, 1/3, 1/3).Visual speechAdd acoustic noise ranging from -5dB to 25 dB.IntroductionModelingExperimentalresultsTest audio-only, visual-only and audio-visual classifiers fordifferent levels of acoustic noise.ConclusionEvaluate performance based on misclassification rate.

Experimental results e ReikerasIntroductionVisual ion 0.60.4Conclusion0.20.0 5051510Signal-to-noise ratio2025

lge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionVisual speech in itself does not contain sufficientinformation for speech recognition.but by combining visual and audio speech features weare able to achieve better performance than what ispossible with audio-only ASR.

Future workAudio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechSpeech features are not i.i.d. (hidden Markov models)(sprint)ModelingAudio and visual speech is asynchronous (dynamicBayesian networks) (GrMPy)ExperimentalresultsAdaptive stream weightingConclusion.Visual speech

The endAudio-VisualAutomaticSpeechRecognitionHelge ReikerasIntroductionAcousticspeechVisual speechModelingExperimentalresultsConclusionThank you!Any questions?

Speech Recognition Helge Reikeras Introduction Acoustic speech Visual speech Modeling Experimental results Conclusion Introduction 1/2 What? Integration of audio and visual speech modalities with the purpose of enhanching speech recognition performance. Why? McGurk eﬀect (e.g. visual /ga/ combined with an audio /ba/ is heard as /da/)

Related Documents:

Dynamic Bayesian networks for audio-visual speech recognition

to speech recognition under less constrained environments. The use of visual features in audio-visual speech recognition (AVSR) is motivated by the speech formation mechanism and the natural ability of humans to reduce audio ambigu-ity using visual cues [1]. In addition, the visual information provides complementary features that cannot be .

10 Views

1y ago

Design and recording of czech audio-visual database with impaired ...

In this paper we discuss the design, acquisition and preprocessing of a Czech audio-visual speech corpus. The corpus is intended for training and testing of existing audio-visual speech recognition system. The name of the database is UWB-07-ICAVR, where ICAVR stands for Impaired Condition Audio Visual speech Recognition.

15 Views

1y ago

Digital Speech Processing - UC Santa Barbara

speech or audio processing system that accomplishes a simple or even a complex task—e.g., pitch detection, voiced-unvoiced detection, speech/silence classification, speech synthesis, speech recognition, speaker recognition, helium speech restoration, speech coding, MP3 audio coding, etc. Every student is also required to make a 10-minute

126 Views

3y ago

Arabic Speech Recognition Systems

Title: Arabic Speech Recognition Systems Author: Hamda M. M. Eljagmani Advisor: Veton Këpuska, Ph.D. Arabic automatic speech recognition is one of the difficult topics of current speech recognition research field. Its difficulty lies on rarity of researches related to Arabic speech recognition and the data available to do the experiments.

12 Views

1y ago

Selection table for media technology - HELUKABEL

765 S MEDIA TECHNOLOGY Designation Properties Page Audio Audio cables with braided shielding 766 Audio Audio cables, multicore with braided shielding 767 Audio Audio cables with foil shielding, single pair 768 Audio Audio cables, multipaired with foil shielding 769 Audio Audio cables, multipaired, spirally screened pairs and overall braided shielding 770 Audio Digital audio cables AES/EBU .

12 Views

4m ago

Research and simulation on speech recognition by Matlab - DiVA portal

speech recognition has acts an important role at present. Using the speech recognition system not only improves the efficiency of the daily life, but also makes people's life more diversified. 1.2 The history and status quo of Speech Recognition The researching of speech recognition technology is started in 1950s. H . Dudley who had

12 Views

1y ago

Multi-language Datasets for Speech Recognition Based on The End-to-end ...

translation. Speech recognition plays a primary role in human-computer interaction, so speech recognition research has essential academic value and application value. Speech recognition refers to the conversion from audio to text. In the early stages of the research work, since it was impossible to directly model the audio-to-text con-

9 Views

1y ago

STM32 32-bit Cortex™-M MCUs

STM32 32-bit Cortex -M MCUs Releasing your creativity . What does a developer want in an MCU? 2 Software libraries Cost sensitive Advanced peripherals Scalable device portfolio Rich choice of tools Leading edge core Ultra-low-power . STM32 platform key benefits More than 450 compatible devices Releasing your creativity 3 . STM32 a comprehensive platform Flash size (bytes) Select your fit .

247 Views

3y ago

Recent Views

Quotes within Quotes: When Single (') and Double (") Quotes . - SAS

Here the outside double quotes are replaced by a single quote and the apostrophe is replaced by two single quotes. This works because when the parser sees two single (or double) quotes immediately following each other, the parser resolves them into one quote mark after the closing quote has been determined.

1y ago

237 Views

What These Inspirational Quotes Say

Self Motivation Quotes Success Quotes Teacher Quotes And after reading all of these inspirational quotes you’d like to share which quotation is . -- Brian Tracy "You must constantly ask yourself these questions: Who am I around? What are they doing to me? Wha

2y ago

302 Views

Columbus,Ohio 1890

Slicing Steaks 3563 Beef Tender, Select In Stock 3852 Angus XT Shoulder Clod, Choice In Stock 3853 Angus XT Chuck Roll, Choice 20/up In Stock 3856 Angus XT Peeled Knuckle In Stock 3857 Angus XT Inside Rounds In Stock 3858 Angus XT Flats, Choice In Stock 3859 Angus XT Eye Of Round, Choice In Stock 3507 Point Off Bnls Beef Brisket, Choice In Stock

2y ago

268 Views

MSN Stock Quotes Web Part - Amrein Engineering

The MSN Stock Quotes Web Part uses the public MSN Money Central Stock Quote Web Service to display selected stock quote information. The data are delayed by 20 minutes and provided by MSN Mo

2y ago

242 Views

Quotations - Free Website Builder: Create free websites

cards, but sometimes, playing a poor hand well." . 50th Birthday Quotes 60th Birthday Quotes And there are more. Funny Birthday Quotes Cute Birthday Quotes . it a try, itʼs free. Triumph over failure can be a

2y ago

267 Views

The Top 100 Motivational & Inspirational Quotes for 2015

I've spent hours crawling through the web trying to find the best quotes to keep me motivated and inspired all throughout the New Year. I've saved hundreds of quotes on my laptop and figured that words alone could motivate and inspire me. but if I couple the quotes

2y ago

329 Views

Inspirational Quotes - Guideposts

Inspirational Quotes Inspiring quotes are like vitamins for the soul. From the heartfelt to the humorous, the words of wisdom you’ll find here will strengthen your faith, lift your spirits, and even spark a positive change in your life. This collection of some our favorite inspirational quotes from religious figures, world leaders, authors,

2y ago

553 Views

Buying Your First Stock - Stock-Trak

Stock Market Game Time: 15 Minutes Requires: StockTrak Curriculum , Computer Access Buying Your First Stock This lesson is an introduction to buying a stock. Students will be introduced to basic vocabulary that is involved with a buying and owning a stock. Stu-dents will be going through the entire process of buying a stock from looking

1y ago

164 Views

TRAINING - CamInstructor

Mastercam Training Guide Mill-Lesson-4-9 6. Change the parameters to match the Stock Setup screenshot below: Stock Setup Stock Origin The stock origin is the X-Y-Z coordinate position of the point indicated by the cross in the picture of the stock model. Use it so Mastercam knows where your stock model is located relative to your part and

3y ago

242 Views

WPX Energy, Inc. - Feltl and Company

WPX Energy, Inc. Common Stock We are offering 27,000,000 shares of our common stock. Our common stock is listed on the New York Stock Exchange under the symbol “WPX.” On July 10, 2015, the last reported sale price for our common stock on the New York Stock Exchange (the “NYSE”) was 11.22 per share.

3y ago

172 Views

Spray 2020 Corporate Profiles - industry-publications

Custom plastic tubes (mono & multi-layer, ABL and Polyami) Stock and custom plastic, metal, and wood caps and closures Stock and custom fine mist, treatment and lotion pumps Stock and custom droppers Stock and custom rollerballs/roll-ons Stock sampler bottles and vials Stock German Quality cosmetic pencil sharpeners

2y ago

180 Views

The Stock Market Profits Blueprint - Liberated Stock Trader

The stock market profits blueprint has been hand crafted to enable you to understand all the factors that play on the stock market. It is called a blueprint because a blueprint is in effect an architectural document to show how something is designed. The Blueprint will show you a powerful way to envisage how the stock market and the stock market

1y ago

181 Views

The Impact of Persian News on Stock Returns Through Text Mining Techniques

Persian news - on the stock prices has been neglected. Consequently, this study aimed to fill this gap. To this aim, the stock index values were collected from the Tehran Stock Exchange along with the . Stock market prediction is a way to understand the future fluctuations of a company's stock price (Jishag et al., 2020). Generally, two .

1y ago

225 Views

Stock Market Uncertainty and the Stock-Bond Return Relation

implied volatility and stock turnover may prove useful for ﬁnancial applications that need to under-stand and predict stock and bond return co-movements. Finally, our empirical results suggest that the beneﬁts of stock-bond diversiﬁcation increase during periods of high stock market uncertainty. This study is organized as follow.

1y ago

158 Views

Operation of Stock Exchange - Williams College

Class Notes Operation of Stock Exchange - 3 - Buying on Margin "Margin" is borrowing money from your broker to buy a stock and using your invest-ment as collateral. Example Buy paying full price Buy stock at 60. Stock price goes to 90. Return (90 - 60)/60 50% Buy on "margin" Buy stock at 60. Borrow 30; you pay 30.

1y ago

138 Views

Audio-Visual Automatic Speech Recognition

It looks like you're using an ad-blocker