Lecture 1 - Introduction/Signal Processing, Part I

3y ago
44 Views
3 Downloads
6.25 MB
94 Pages
Last View : 6d ago
Last Download : 5m ago
Upload by : Eli Jorgenson
Transcription

Lecture 1Introduction/Signal Processing, Part IMichael Picheny, Bhuvana Ramabhadran, Stanley F. Chen,Markus Nussbaum-ThomWatson GroupIBM T.J. Watson Research CenterYorktown Heights, New York, 0 January 2016

Part IIntroduction2 / 94

Three QuestionsWhy are you taking this course?What do you think you might learn?How do you think this may help you in the future?3 / 94

What Is Speech Recognition?Converting speech to text (STT).a.k.a. automatic speech recognition (ASR).What it’s not.Natural language understanding — e.g., Siri.Speech synthesis — converting text to speech (TTS),e.g., Watson.Speaker recognition — identifying who is speaking.4 / 94

Why Is Speech Recognition Important?5 / 94

Because It’s Fastmodalitymethodrate (words/min)soundspeech150–200sightsign language; gestures100–150touchtyping; mousing60tastedipping self in different flavorings 1smellspraying self with perfumes 16 / 94

Because it’s easier to process text than audiovs7 / 94

Because It’s Hands Free8 / 94

Because It’s a Natural Form ofCommunication9 / 94

Key ApplicationsTranscription: archiving/indexing audio.Legal; medical; television and movies.Call centers.Whenever you interact with a computer . . .Without sitting in front of one.e.g., smart or dumb phone; car; home entertainment.Accessibility.People who can’t type, or type slowly.The hard of hearing.10 / 94

Why Study Speech Recognition?Learn a lot about many popular machine learningtechniques.They all originated in speech.Be exposed to a real problem with real data — no artificialingredients.Learn how to build a complex end-to-end system.Toto, we aren’t in Kansas anymore!Not solved yet, so maybe you will be inspired to make ityour life’s work — like we have!11 / 94

Where Are We?1Course Overview2Speech Recognition from 10,000 Feet Up3A Brief History of Speech Recognition4Speech Production and Perception12 / 94

Who Are We?Stanley F. Chen: Productive ResearcherMarkus Nussbaum-Thom: Productive ResearcherBhuvana Ramabhadran: Useless ManagerMichael Picheny: Even More Useless Senior ManagerWe are all from the Watson Multimodal Group located at theIBM T.J. Watson Research Center in Yorktown Heights, NY.13 / 94

What is the Watson Group?14 / 94

Why Four Professors?Too much knowledge to fit in one brain.Signal processing.Probability and statistics.Phonetics; linguistics.Natural language processing.Machine learning; artificial intelligence.Automata theory.Optimization.15 / 94

How To Contact UsIn E-mail, prefix subject line with “EECS E6870:”!!!.Michael Picheny — picheny@us.ibm.com.Bhuvana Ramabhadran — bhuvana@us.ibm.com.Stanley F. Chen — stanchen@us.ibm.com.Markus Nussbaum-Thom — nussbaum@us.ibm.com.Office hours: right after class.Before class by appointment.TA: TBDCourseworks.For posting questions about labs.16 / 94

Course 8recess91011121314studytopicIntroductionSignal processing; DTWGaussian mixture modelsHidden Markov modelsLanguage modeling 101Pronunciation modelingTraining Speech Recognition SystemsThe Search ProblemThe Search Problem, continuedLanguage Modeling 201Robustness and AdaptationDiscriminative Training, ROVER and ConsensusNeural Networks 101Neural Networks 201Project Presentationsassignedduelab 1lab 2lab 1lab 3lab 2lab 4lab 3lab 5lab 4lab 5project17 / 94

Programming Assignments 80% of grade ( , , grading).Some short written questions.Write key parts of basic large vocabulary continuousspeech recognition system.Only the “fun” parts.C code infrastructure provided by us.Get account on ILAB computer cluster (x86 Linux PC’s).Login to cluster using ssh.Can’t run labs on PC’s/Mac’s.If not yet signed up for course, but going to add:Fill out index card with name, UNI, and E-mail address.Or E-mail this info to stanchen@us.ibm.com.18 / 94

Final Project20% of grade.Option 1: Reading project (individual).Pick paper(s) from provided list, or propose your own.Write 1500–2500 word paper reviewing analyzingpaper(s).Option 2: Programming/experimental project (group).Pick project from provided list, or propose your own.Group gives 10–15m presentation summarizing projectand writes paper.40% of grade (if helps).19 / 94

ReadingsPDF versions of readings will be available on the web site.Recommended text:Speech Synthesis and Recognition, Holmes, 2ndedition (paperback, 256 pp., 2001) [Holmes].Reference texts:Theory and Applications of Digital Signal Processing,Rabiner, Schafer (hardcover, 1056 pp., 2010) [R S].Speech and Language Processing, Jurafsky, Martin(2nd edition, hardcover, 1024 pp., 2000) [J M].Statistical Methods for Speech Recognition, Jelinek(hardcover, 305 pp., 1998) [Jelinek].Spoken Language Processing, Huang, Acero, Hon(paperback, 1008 pp., 2001) [HAH].20 / 94

Web Sitetinyurl.com/e6870s16 www.ee.columbia.edu/ stanchen/spring16/e6870/Syllabus.Slides from lectures (PDF).Online after each lecture.Save trees — no hardcopies!Lab assignments (PDF).Reading assignments (PDF).Online by lecture they are assigned.Username: speech, password: pythonrules.21 / 94

PrerequisitesBasic knowledge of probability and statistics.Willingness to implement algorithms in C .Only basic features of C used; 100 lines/lab.Basic knowledge of Unix or Linux.Knowledge of digital signal processing optional.Helpful for understanding signal processing lectures;i.e., CS majors may find signal processing materialbaffling!Not needed for labs!22 / 94

Help Us Help YouFeedback questionnaire after each lecture (2 questions).Feedback welcome any time.You, the student, are partially responsible . . .For the quality of the course.Please ask questions anytime!EE’s may find CS parts challenging, and vice versa.Together, we can get through this.Let’s go!23 / 94

Where Are We?1Course Overview2Speech Recognition from 10,000 Feet Up3A Brief History of Speech Recognition4Speech Production and Perception24 / 94

What is the basic goal?Recognize as many words correctly as possible.Use those algorithms that lower the Word Error RateImperfect but very useful simple to measure objectivecriterion25 / 94

Why is this difficult? (Part I)26 / 94

A Thousand Times No!27 / 94

Why is this difficult? (Part II)28 / 94

Basic Concepts29 / 94

Historical Developments30 / 94

Where Are We?1Course Overview2Speech Recognition from 10,000 Feet Up3A Brief History of Speech Recognition4Speech Production and Perception31 / 94

The Early Years: 1950–1960’sAd hoc methods.Many key ideas introduced; not used all together.e.g., spectral analysis; statistical training; languagemodeling.Small vocabulary.Digits; yes/no; vowels.Not tested with many speakers (usually 10).32 / 94

The Birth of Modern ASR: 1970–1980’sEvery time I fire a linguist, the performance of thespeech recognizer goes up.—Fred Jelinek, IBMIgnore (almost) everything we know about phonetics,linguistics.View speech recognition as . . . .Finding most probable word sequence given audio.Train probabilities automatically w/ transcribed speech.33 / 94

The Birth of Modern ASR: 1970–1980’sMany key algorithms developed/refined.Expectation-maximization algorithm; n-gram models;Gaussian mixtures; Hidden Markov models; Viterbidecoding; etc.Computing power still catching up to algorithms.First real-time dictation system built in 1984 (IBM).Specialized hardware required — had the computationpower of a 60 MHz Pentium.34 / 94

The Golden Years: 1990’s–nowCPU speedtraining dataoutput distributionssequence modelinglanguage models199460 MHz 10hGMMHMMn-gramnow3 GHz10000h NN /GMM hybridsHMM and/or NNn-gram and NNBasic algorithms have remained similar but now seeinghuge penetration of NN technologies.Significant performance gains can also be attributed topresence of more data, faster CPU’s, and more run-timememory.35 / 94

Person vs. Machine (Lippmann, 1997)taskmachine human ratioConnected Digits10.72% 0.009% 80 2Letters5.0%1.6%3 Resource Management3.6%0.1%36 WSJ7.2%0.9%8 Switchboard43%4.0%11 For humans, one system fits all; for machine, not.Today: Switchboard WER 8%.But that is with 2000 hours of SWB training data; can’tassume this is always available.12String error rates.Isolated letters presented to humans; continuous for machine.36 / 94

Commercial Speech Recognition1995 – 1998 — first large vocabulary speaker dependentdictation systems.1996 – 2005 — first telephony- based customer assistancesystems.2003 – 2007 — first automotive interactive systems.2008 – 2010 — first voice search systems.2011 – today — growth of cloud-based speech services.37 / 94

What’s left?AccentsNoiseFar field microphonesInformal speech38 / 94

Are We Awake?Of the time you spend interacting with devices . . .What fraction do you use ASR?What fraction would it be if ASR were perfect?What are the biggest problems with current ASRperformance?39 / 94

The First Two LecturesA little background on speech production and perception.signal processing — Extract features from audio A A0 . . .That discriminate between different words.Normalize for volume, pitch, voice quality, noise, . . . .dynamic time warping —Handling time/rate variation.40 / 94

Where Are We?1Course Overview2Speech Recognition from 10,000 Feet Up3A Brief History of Speech Recognition4Speech Production and Perception41 / 94

Data-Driven vs. Knowledge-DrivenDon’t ignore everything we know about speech, language.?dumbsmartKnowledge/concepts that have proved useful.Words; phonemes.A little bit of human production/perception.Knowledge/concepts that haven’t proved useful (yet).Nouns; vowels; syllables; voice onset time; . . .42 / 94

Finding Good FeaturesExtract features from audio . . .That help determine word identity.What are good types of features?Instantaneous air pressure at time t?Loudness at time t?Energy or phase for frequency ω at time t?Estimated position of speaker’s lips at time t?Look at human production and perception for insight.Also, introduce some basic speech terminology.43 / 94

Speech ProductionAir comes out of lungs.Vocal cords tensed (vibrate voicing) or relaxed(unvoiced).Modulated by vocal tract (glottis lips); resonates.Articulators: jaw, tongue, velum, lips, mouth.44 / 94

Speech Consists Of a Few Primitive Sounds?Phonemes.40 to 50 for English.Speaker/dialect differences.e.g., do MARY, MARRY, and MERRY rhyme?Phone: acoustic realization of a phoneme.May be realized differently based on context.allophones: different ways a phoneme can be realized.e.g., P in SPIN, PIN are two different allophones of P.spelling phonemesSPINS P IH NPINP IH Ne.g., T in BAT, BATTER; A in BAT, BAD.45 / 94

Classes of Speech SoundsCan categorize phonemes by how they are produced.Voicing.e.g., F (unvoiced), V (voiced).All vowels are voiced.Stops/plosives.Oral cavity blocked (e.g., lips, velum); then opened.e.g., P, B (lips).46 / 94

Classes of Speech SoundsSpectogram shows energy at each frequency over time.Voiced sounds have pitch (F0); formants (F1, F2, F3).Very highly trained humans can do recognition onspectrograms with high accuracy so this is a validrepresentation.47 / 94

Classes of Speech SoundsWhat can the machine do? Here is a sample on TIMIT:48 / 94

Classes of Speech SoundsVowels — EE, AH, etc.Differ in locations of formants.Dipthongs — transition between two vowels (e.g., COY,COW).Consonants.Fricatives — F, V, S, Z, SH, J.Stops/plosives — P, T, B, D, G, K.Nasals — N, M, NG.Semivowels (liquids, glides) — W, L, R, Y.49 / 94

CoarticulationRealization of a phoneme can differ very much dependingon context (allophones).Where articulators were for last phone affect how theytransition to next.50 / 94

Speech Production and ASRDirectly use features from acoustic phonetics?e.g., (inferred) location of articulators; voicing; formantfrequencies.In practice, has not been made to workStill, influences how signal processing is done.Source-filter model.Separate excitation from modulation from vocal tract.e.g., frequency of excitation can be ignored (English).51 / 94

Speech Perception and ASRAs it turns out, the features that work well . . . .Motivated more by speech perception than production.e.g., Mel Frequency Cepstral Coefficients (MFCC).Motivated by human perception of pitch.Similarly for perceptual linear prediction (PLP).52 / 94

Speech Perception — PhysiologySound enters ear; converted to vibrations in cochlear fluid.In fluid is basilar membrane, with 30,000 little hairs.Sensitive to different frequencies (band-pass filters).53 / 94

Speech Perception — PhysiologyHuman physiology used as justification for frequencyanalysis ubiquitous in speech processing.Limited knowledge of higher-level processing.Can glean insight from psychophysical experiments(relationship between physical stimuli and humanresponses)54 / 94

Speech Perception — PsychophysicsSound Pressure Level (SPL) in dB 20 log10 P/P0P0 threshold of hearing at 1 KHz (it varies!)55 / 94

Speech Perception — PsychophysicsDifferent sensitivity of humans to different frequencies.Equal loudness contours.Subjects adjust volume of tone to match volume ofanother tone at different pitch.Tells us what range of frequencies may be good to focus on.56 / 94

Speech Perception — PsychophysicsHuman perception of distance between frequencies.Adjust pitch of one tone until twice/half pitch of other tone.Mel scale — frequencies equally spaced in Mel scale areequally spaced according to human perception.Mel freq 2595 log10 (1 freq/700)57 / 94

Speech Perception — MachineJust as human physiology has its quirks . . .So does machine “physiology”.Sources of distortion.Microphone — different response based on directionand frequency of sound.Sampling frequency — e.g., 8 kHz sampling forlandlines throws away all frequencies above 4 kHz.Analog/digital conversion — need to convert to digitalwith sufficient precision (8–16 bits).Lossy compression — e.g., cellular telephones, VOIP.58 / 94

Speech Perception — MachineInput distortion can still be a significant problem.Mismatched conditions between train/test.Low bandwidth — telephone, cellular.Cheap equipment — e.g., mikes in handheld devices.Enough said.59 / 94

Are We Awake?Sometimes it helps to mimic nature; sometimes not (e.g.,airplanes and flying).Which way should be best for ASR in the long run?Does it make more sense to mimic human speechproduction or perception?Why do humans have two ears, and what does this meanfor ASR?60 / 94

SegueNow that we see what humans do.Let’s discuss what signal processing has been found towork well empirically.Has been tuned over decades.Start with some mathematical background.61 / 94

Part IISignal Processing Basics62 / 94

OverviewBackground material: how to mathematically model/analyzehuman speech production and perception.Introduction to signals and systems.Basic properties of linear systems.Introduction to Fourier analysis.Next week: discussion of actual features used in ASR.Recommended readings: [HAH] pg. 201-223, 242-245.[R J] pg. 69-91. All figures taken from these texts.63 / 94

Speech ProductionThe sound pressure modulations can be captured by amicrophone, converted to an electrical signal, and then digitizedcreating a sequence of numbers we call a "signal".64 / 94

Signals and SystemsSignal: a function x[n] over time .e.g., output of microphone attached to an A/Dconverter.0.50 0.5 100.511.522.54x 10A digital system (or filter ) H takes an input signal x[n] andproduces a signal y [n]:y [n] H(x[n])65 / 94

What do we need to do to this signal to beuseful for speech recognition?Model signal as being generated from a set of time-varyingphysiological variables (vocal tract geometry, glottalvibration, lip radiation, etc.) and extract these variables fromthe signal.Operate on the signal to mimic some of the processingdone in the auditory system, for example — frequencyanalysis.Either way we want to make things simple, so we will focus onlinear processing.66 / 94

Linear Time-Invariant SystemsCalculating output of H for input signal x becomes verysimple if digital system H satisfies two basic properties.H is linear ifH(a1 x1 [n] a2 x2 [n]) a1 H(x1 [n]) a2 H(x2 [n])H is time-invariant ify [n n0 ] H(x[n n0 ])i.e., a shift in the time axis of x produces the same output,except for a time shift.67 / 94

Linear Time-Invariant (LTI) SystemsLet H be a linear system. Defineh(n) H(x(n) δ[n]), δ(n 0) 1, δ(n 6 0) 0h(n) is called the impulse response of the system.Then, by the LTI properties H(x[n]) can be written asy [n] Xk x[k ]h[n k ] Xx[n k ]h[k ]k The above is also known as convolution and is written asy [n] x[n] h[n]So if you know the impulse response of an LTI system it iseasy to calculate the output for any input.68 / 94

Fourier AnalysisMoving towards more meaningful features.Time domain: x[n] air pressure at time n.Frequency domain: X (ω) energy at frequency ω.As we discussed earlier, energy as a function offrequency is what seems to be useful for speechrecognitionThis is very easy to compute when dealing with LTIsystemsCan express (almost) any signal x[n] as sum of sinusoids.Let X (ω) be the coefficient for sinusoid w/ frequency ω.Given x[n], can compute X (ω) efficiently, and vice versa.Time and frequency domain representations areequivalent.Fourier transform converts between representations.69 / 94

Fourier Series Illustration70 / 94

Review: Complex ExponentialsMath is simpler using complex exponentials.Euler’s formula.ejω cos ω j sin ωSinusoid with frequency ω, phase φ.cos(ωn φ) Re(ej(ωn φ) )71 / 94

The Fourier TransformThe discrete-time Fourier transform (DTFT) is defined asX (ω) Xx[n]e jωnn Note: this is a complex quantity.The inverse Fourier transform is defined asZ π1x[n] X (ω)ejωn dω2π πPExists and is invertible as long as x[n] .Can apply DTFT to impulse response as well: h[n] H(ω).72 / 94

The Z-TransformOne can generalize the discrete-time Fourier Transform toX (z) Xx(n)z nn where z is any complex variable. The Fourier Transform isjust the z-transform evaluated at z e jω .The z-transform concept allows us to analyze a large rangeof signals, even those whose integrals are unbounded. Wewill primarily just use it as a notational convenience, though.73 / 94

The Convolution TheoremApply system H to signal x to get signal y : y [n] x[n] h[n].! XXXY (z) y [n]z n x[k ]h[n k ] z nn Xn x[k ] Xk n X Xk Xx[k ]k !h[n k ]z n!h[n]z (n k )n x[k ]z k H(z) X (z) · H(z)k 74 / 94

The Convolution Theorem (cont’d)Duality between time and frequency domains.DTFT(x[n] y [n]) DTFT(x) · DTFT(y )DTFT(x[n] · y [n]) DTFT(x) DTFT(y )i.e., convolution in time domain is same as multiplication infrequency domain, and vice versa.75 / 94

The Discrete Fourier Transform (DFT)Preceding analysis assumes infinite signals:n , . . . , .In reality, can assume signals x[n] are finite and of length N(n 0, . . . , N 1). Then, we can define the DFT asX [k ] N 1Xx[n]e j2πknNn 0where we have replaced ω in the DTFT with 2πk.NThe DFT is just a discrete-frequency version of the DTFTand is needed for any sort of digital processing.The DFT is equivalent to a Fourier series expansion of aperiodic version of x[n].76 / 94

The Discrete Fourier Transform (cont’d)The inverse of the DFT is#"N 1N 1X N 1X2πkn2πkn2πkm1X1X [k ]ej N x[m]e j N ej NNNk 0k 01 NN 1Xm 0m 0x[m]N 1Xej2πk (n m)Nn 0The last sum on the right is N for m n and 0 otherwise, sothe entire right side is just x[n].77 / 94

The Fast Fourier TransformNote that the computation ofX [k ] N 1Xx[n]e j 2πknN n 0N 1Xx[n]WNnkn 02for k 0, . . . , N 1 requires O(N ) operations.Let f [n] x[2n] and g[n] x[2n 1]. Then, we haveN/2 1N/2 1X [k ] Xnkf [n]WN/2 Xnkg[n]WN/2n 0n 0 F [k

Knowledge of digital signal processing optional. Helpful for understanding signal processing lectures; i.e.,CS majors may find signal processing material baffling! Not needed for labs! 22/94. Help Us Help You Feedback questionnaire after each lecture (2 questions). Feedback welcome any time.

Related Documents:

Introduction of Chemical Reaction Engineering Introduction about Chemical Engineering 0:31:15 0:31:09. Lecture 14 Lecture 15 Lecture 16 Lecture 17 Lecture 18 Lecture 19 Lecture 20 Lecture 21 Lecture 22 Lecture 23 Lecture 24 Lecture 25 Lecture 26 Lecture 27 Lecture 28 Lecture

most of the digital signal processing concepts have benn well developed for a long time, digital signal processing is still a relatively new methodology. Many digital signal processing concepts were derived from the analog signal processing field, so you will find a lot o f similarities between the digital and analog signal processing.

Lecture 1: A Beginner's Guide Lecture 2: Introduction to Programming Lecture 3: Introduction to C, structure of C programming Lecture 4: Elements of C Lecture 5: Variables, Statements, Expressions Lecture 6: Input-Output in C Lecture 7: Formatted Input-Output Lecture 8: Operators Lecture 9: Operators continued

Lecture 1: Introduction and Orientation. Lecture 2: Overview of Electronic Materials . Lecture 3: Free electron Fermi gas . Lecture 4: Energy bands . Lecture 5: Carrier Concentration in Semiconductors . Lecture 6: Shallow dopants and Deep -level traps . Lecture 7: Silicon Materials . Lecture 8: Oxidation. Lecture

University of Crete, Computer Science Department 5 Lecture 1: Introduction to WSN and CS-541 course Lecture 2: Protocol stacks, and wireless networks prerequisites. Lecture 3: Network standards for Personal and Body-area networks Lecture 4: Signal processing prerequisites. Lecture 5: Signal Sampling for WSN Lecture 6: Radio Duty Cycling in WSN

TOEFL Listening Lecture 35 184 TOEFL Listening Lecture 36 189 TOEFL Listening Lecture 37 194 TOEFL Listening Lecture 38 199 TOEFL Listening Lecture 39 204 TOEFL Listening Lecture 40 209 TOEFL Listening Lecture 41 214 TOEFL Listening Lecture 42 219 TOEFL Listening Lecture 43 225 COPYRIGHT 2016

Partial Di erential Equations MSO-203-B T. Muthukumar tmk@iitk.ac.in November 14, 2019 T. Muthukumar tmk@iitk.ac.in Partial Di erential EquationsMSO-203-B November 14, 2019 1/193 1 First Week Lecture One Lecture Two Lecture Three Lecture Four 2 Second Week Lecture Five Lecture Six 3 Third Week Lecture Seven Lecture Eight 4 Fourth Week Lecture .

Young integral Z t 0 y sdx s; x;y 2C ([0;1]) Recall theRiemann-Stieltjes integral: Z 1 0 y sdx s B lim jPj!0 X [s;t]2P y s ( x t{z x s}) Cx s;t () Pa finite partition of [0;1] Th