Speech Recognition: Statistical Methods

1y ago
7 Views
2 Downloads
627.69 KB
18 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Abram Andresen
Transcription

Speech Recognition: Statistical Methods 1Speech Recognition: Statistical MethodsL R Rabiner, Rutgers University, New Brunswick,NJ, USA and University of California, Santa Barbara,CA, USAB-H Juang, Georgia Institute of Technology, Atlanta,GA, USAß 2006 Elsevier Ltd. All rights reserved.IntroductionThe goal of getting a machine to understand fluentlyspoken speech and respond in a natural voice hasbeen driving speech research for more than 50 years.Although the personification of an intelligent machine such as HAL in the movie 2001, A Space Odyssey, or R2D2 in the Star Wars series, has been aroundfor more than 35 years, we are still not yet at the pointwhere machines reliably understand fluent speech,spoken by anyone, and in any acoustic environment.In spite of the remaining technical problems that needto be solved, the fields of automatic speech recognition and understanding have made tremendousadvances and the technology is now readily availableand used on a day-to-day basis in a number of applications and services – especially those conductedover the public-switched telephone network (PSTN)(Cox et al., 2000). This article aims at reviewing thetechnology that has made these applications possible.Speech recognition and language understanding aretwo major research thrusts that have traditionallybeen approached as problems in linguistics and acoustic phonetics, where a range of acoustic phoneticknowledge has been brought to bear on the problemwith remarkably little success. In this article, however, we focus on statistical methods for speech andlanguage processing, where the knowledge abouta speech signal and the language that it expresses,together with practical uses of the knowledge, is developed from actual realizations of speech datathrough a well-defined mathematical and statisticalformalism. We review how the statistical methods areused for speech recognition and language understanding, show current performance on a number oftask-specific applications and services, and discussthe challenges that remain to be solved before thetechnology becomes ubiquitous.The Speech AdvantageThere are fundamentally three major reasons why somuch research and effort has gone into the problemof trying to teach machines to recognize and understand fluent speech, and these are the following:. Cost reduction. Among the earliest goals for speechrecognition systems was to replace humansperforming certain simple tasks with automatedmachines, thereby reducing labor expenseswhile still providing customers with a natural andconvenient way to access information and services.One simple example of a cost reduction system wasthe Voice Recognition Call Processing (VRCP) system introduced by AT&T in 1992 (Roe et al.,1996), which essentially automated so-called operator-assisted calls, such as person-to-person calls,reverse-billing calls, third-party billing calls, collectcalls (by far the most common class of such calls),and operator-assisted calls. The resulting automation eliminated about 6600 jobs, while providing aquality of service that matched or exceeded thatprovided by the live attendants, saving AT&T onthe order of 300 million per year. New revenue opportunities. Speech recognitionand understanding systems enabled service providers to have a 24/7 high-quality customer careautomation capability, without the need for accessto information by keyboard or touch-tone buttonpushes. An example of such a service was the HowMay I Help You (HMIHY)ß service introduced byAT&T late in 2000 (Gorin et al., 1996), whichautomated the customer care for AT&T ConsumerServices. This system will be discussed further inthe section on speech understanding. A second example of such a service was the NTT ANSER service for voice banking in Japan [Sugamura et al.,1994], which enabled Japanese banking customersto access bank account records from an ordinarytelephone without having to go to the bank. (Ofcourse, today we utilize the Internet for such information, but in 1981, when this system was introduced, the only way to access such records was aphysical trip to the bank and a wait in lines to speakto a banking clerk.). Customer retention. Speech recognition providesthe potential for personalized services based oncustomer preferences, and thereby the potential toimprove the customer experience. A trivial example of such a service is the voice-controlled automotive environment that recognizes the identity ofthe driver from voice commands and adjuststhe automobile’s features (seat position, radio station, mirror positions, etc.) to suit the customer’spreference (which is established in an enrollmentsession).The Speech Dialog CircleWhen we consider the problem of communicatingwith a machine, we must consider the cycle of eventsthat occurs between a spoken utterance (as part of

2 Speech Recognition: Statistical Methodsa dialog between a person and a machine) and theresponse to that utterance from the machine. Figure 1shows such a sequence of events, which is often referred to as the speech dialog circle, using an examplein the telecommunications context.The customer initially makes a request by speakingan utterance that is sent to a machine, which attemptsto recognize, on a word-by-word basis, the spokenspeech. The process of recognizing the words in thespeech is called automatic speech recognition (ASR)and its output is an orthographic representation ofthe recognized spoken input. The ASR process will bediscussed in the next section. Next the spoken wordsare analyzed by a spoken language understanding(SLU) module, which attempts to attribute meaningto the spoken words. The meaning that is attributed isin the context of the task being handled by the speechdialog system. (What is described here is traditionallyreferred to as a limited domain understanding systemor application.) Once meaning has been determined,the dialog management (DM) module examines thestate of the dialog according to a prescribed operational workflow and determines the course of actionthat would be most appropriate to take. The actionmay be as simple as a request for further informationor confirmation of an action that is taken. Thus ifthere were confusion as to how best to proceed, a textquery would be generated by the spoken languagegeneration module to hopefully clarify the meaningand help determine what to do next. The query textFigure 1 The conventional speech dialog circle.is then sent to the final module, the text-to-speechsynthesis (TTS) module, and then converted intointelligible and highly natural speech, which is sentto the customer who decides what to say next basedon what action was taken, or based on previous dialogs with the machine. All of the modules in the speechdialog circle can be ‘data-driven’ in both the learning and active use phases, as indicated by the centralData block in Figure 1.A typical task scenario, e.g., booking an airlinereservation, requires navigating the speech dialog circle many times – each time being referred to as one‘turn’ – to complete a transaction. (The average number of turns a machine takes to complete a prescribedtask is a measure of the effectiveness of the machinein many applications.) Hopefully, each time throughthe dialog circle enables the customer to get closer tothe desired action either via proper understandingof the spoken request or via a series of clarificationsteps. The speech dialog circle is a powerful conceptin modern speech recognition and understanding systems, and is at the heart of most speech understandingsystems that are in use today.Basic ASR FormulationThe goal of an ASR system is to accurately and efficiently convert a speech signal into a text messagetranscription of the spoken words, independentof the device used to record the speech (i.e., the

Speech Recognition: Statistical Methods 3transducer or microphone), the speaker, or the environment. A simple model of the speech generationprocess, as used to convey a speaker’s intention isshown in Figure 2.It is assumed that the speaker decides what to sayand then embeds the concept in a sentence, W, whichis a sequence of words (possibly with pauses andother acoustic events such as uh’s, um’s, er’s, etc.).The speech production mechanisms then produce aspeech waveform, sðnÞ, which embodies the words ofW as well as the extraneous sounds and pauses inthe spoken input. A conventional automatic speechrecognizer attempts to decode the speech, sðnÞ, intothe best estimate of the sentence, Ŵ, using a two-stepprocess, as shown in Figure 3.The first step in the process is to convert the speechsignal, sðnÞ, into a sequence of spectral feature vectors, X, where the feature vectors are measured every10 ms (or so) throughout the duration of the speechsignal. The second step in the process is to use asyntactic decoder to generate every possible validsentence (as a sequence of orthographic representations) in the task language, and to evaluate the score(i.e., the a posteriori probability of the word stringgiven the realized acoustic signal as measured by thefeature vector) for each such string, choosing as therecognized string, Ŵ, the one with the highest score.This is the so-called maximum a posteriori probability (MAP) decision principle, originally suggestedby Bayes. Additional linguistic processing can bedone to try to determine side information about thespeaker, such as the speaker’s intention, as indicatedin Figure 3.Mathematically, we seek to find the string Ŵ thatmaximizes the a posteriori probability of that string,when given the measured feature vector X, i.e.,Figure 2 Model of spoken speech.Figure 3 ASR decoder from speech to sentence.Ŵ ¼ arg max PðWjXÞWUsing Bayes Law, we can rewrite this expression as:Ŵ ¼ arg maxWPðXjWÞPðWÞPðXÞThus, calculation of the a posteriori probability isdecomposed into two main components, one thatdefines the a priori probability of a word sequenceW, P(W), and the other the likelihood of the wordstring W in producing the measured feature vector,P(X W). (We disregard the denominator term, PðXÞ,since it is independent of the unknown W). The latteris referred to as the acoustic model, PA ðXjWÞ, and theformer the language model, PL ðWÞ (Rabiner et al.,1996; Gauvain and Lamel, 2003). We note that thesequantities are not given directly, but instead are usually estimated or inferred from a set of trainingdata that have been labeled by a knowledge source,i.e., a human expert. The decoding equation is thenrewritten as:Ŵ ¼ arg max PA ðXjWÞPL ðWÞWWe explicitly write the sequence of feature vectors(the acoustic observations) as:X ¼ x1 ; x2 ; . . . ;xNwhere the speech signal duration is N frames (or Ntimes 10 ms when the frame shift is 10 ms). Similarly,we explicitly write the optimally decoded wordsequence as:Ŵ ¼ w1 w2 . . . wMwhere there are M words in the decoded string. Theabove decoding equation defines the fundamentalstatistical approach to the problem of automaticspeech recognition.It can be seen that there are three steps to the basicASR formulation, namely:. Step 1: acoustic modeling for assigning probabilities to acoustic (spectral) realizations of a sequence

4 Speech Recognition: Statistical Methodsof words. For this step we use a statistical model(called the hidden Markov model or HMM) of theacoustic signals of either individual words or subword units (e.g., phonemes) to compute the quantity PA ðXjWÞ. We train the acoustic models from atraining set of speech utterances, which have beenappropriately labeled to establish the statisticalrelationship between X and W. Step 2: language modeling for assigning probabilities, PL ðWÞ, to sequences of words that form validsentences in the language and are consistent withthe recognition task being performed. We trainsuch language models from generic text sequences,or from transcriptions of task-specific dialogues.(Note that a deterministic grammar, as is used inmany simple tasks, can be considered a degenerateform of a statistical language model. The ‘coverage’of a deterministic grammar is the set of permissibleword sequences, i.e., expressions that are deemedlegitimate.). Step 3: hypothesis search whereby we find theword sequence with the maximum a posterioriprobability by searching through all possibleword sequences in the language.In step 1, acoustic modeling (Young, 1996; Rabineret al., 1986), we train a set of acoustic models for thewords or sounds of the language by learning thestatistics of the acoustic features, X, for each wordor sound, from a speech training set, where we compute the variability of the acoustic features during theproduction of the words or sounds, as represented bythe models. For large vocabulary tasks, it is impractical to create a separate acoustic model for everypossible word in the language since it requires fartoo much training data to measure the variability inevery possible context. Instead, we train a set of about50 acoustic-phonetic subword models for the approximately 50 phonemes in the English language,and construct a model for a word by concatenating(stringing together sequentially) the models for theconstituent subword sounds in the word, as definedin a word lexicon or dictionary, where multiple pronunciations are allowed). Similarly, we build sentences (sequences of words) by concatenating wordmodels. Since the actual pronunciation of a phonememay be influenced by neighboring phonemes (thoseoccurring before and after the phoneme), the set ofso-called context-dependent phoneme models areoften used as the speech models, as long as sufficientdata are collected for proper training of these models.In step 2, the language model (Jelinek, 1997;Rosenfeld, 2000) describes the probability of a sequence of words that form a valid sentence in the tasklanguage. A simple statistical method works well,based on a Markovian assumption, namely that theprobability of a word in a sentence is conditioned ononly the previous N–1 words, namely an N-gramlanguage model, of the form:PL ðWÞ ¼ PL ðw1 ; w2 ; . . . ; wM Þ¼MYPL ðwm jwm 1 ; wm 2 ; . . . ; wm Nþ1 Þm¼1where PL ðwm jwm 1 ; wm 2 ; . . . ; wm Nþ1 Þ is estimatedby simply counting up the relative frequencies ofN-tuples in a large corpus of text.In step 3, the search problem (Ney, 1984; Paul,2001) is one of searching the space of all validsound sequences, conditioned on the word grammar,the language syntax, and the task constraints, to findthe word sequence with the maximum likelihood.The size of the search space can be astronomicallylarge and take inordinate amounts of computingpower to solve by heuristic methods. The use ofmethods from the field of Finite State AutomataTheory provide finite state networks (FSNs)(Mohri, 1997), along with the associated searchpolicy based on dynamic programming, that reducethe computational burden by orders of magnitude,thereby enabling exact solutions in computationally feasible times, for large speech recognitionproblems.Development of a Speech RecognitionSystem for a Task or an ApplicationBefore going into more detail on the various aspectsof the process of automatic speech recognition bymachine, we review the three steps that must occurin order to define, train, and build an ASR system(Juang et al., 1995; Kam and Helander, 1997). Thesesteps are the following:. Step 1: choose the recognition task. Specify theword vocabulary for the task, the set of units thatwill be modeled by the acoustic models (e.g., wholewords, phonemes, etc.), the word pronunciationlexicon (or dictionary) that describes the variationsin word pronunciation, the task syntax (grammar),and the task semantics. By way of example, for asimple speech recognition system capable of recognizing a spoken credit card number using isolateddigits (i.e., single digits spoken one at a time), thesounds to be recognized are either whole wordsor the set of subword units that appear in the digits/zero/ to /nine/ plus the word /oh/. The word vocabulary is the set of 11 digits. The task syntaxallows any single digit to be spoken, and the task

Speech Recognition: Statistical Methods 5Figure 4 Framework of ASR system.semantics specify that a sequence of isolated digitsmust form a valid credit card code for identifyingthe user. Step 2: train the models. Create a method for building acoustic word models (or subword models) froma labeled speech training data set of multiple occurrences of each of the vocabulary words by one ormore speakers. We also must use a text trainingdata set to create a word lexicon (dictionary) describing the ways that each word can be pronounced(assuming we are using subword units to characterize individual words), a word grammar (or languagemodel) that describes how words are concatenatedto form valid sentences (i.e., credit card numbers),and finally a task grammar that describes whichvalid word strings are meaningful in the taskapplication (e.g., valid credit card numbers). Step 3: evaluate recognizer performance. We needto determine the word error rate and the task errorrate for the recognizer on the desired task. For anisolated digit recognition task, the word error rateis just the isolated digit error rate, whereas the taskerror rate would be the number of credit carderrors that lead to misidentification of the user.Evaluation of the recognizer performance oftenincludes an analysis of the types of recognitionerrors made by the system. This analysis can leadto revision of the task in a number of ways, rangingfrom changing the vocabulary words or the grammar (i.e., to eliminate highly confusable words) tothe use of word spotting, as opposed to word transcription. As an example, in limited vocabularyapplications, if the recognizer encounters frequentconfusions between words like ‘freight’ and ‘flight,’it may be advisable to change ‘freight’ to ‘cargo’ tomaximize its distinction from ‘flight.’ Revision ofthe task grammar often becomes necessary if therecognizer experiences substantial amounts ofwhat is called ‘out of grammar’ (OOG) utterances,namely the use of words and phrases that are notdirectly included in the task vocabulary (ISCA,2001).The Speech Recognition ProcessIn this section, we provide some technical aspects of atypical speech recognition system. Figure 4 shows ablock diagram of a speech recognizer that follows theBayesian framework discussed above.The recognizer consists of three processing steps,namely feature analysis, pattern matching, and confidence scoring, along with three trained databases, theset of acoustic models, the word lexicon, and thelanguage model. In this section, we briefly describeeach of the processing steps and each of the trainedmodel databases.Feature AnalysisThe goal of feature analysis is to extract a set ofsalient features that characterize the spectral properties of the various speech sounds (the subword units)and that can be efficiently measured. The ‘standard’feature set for speech recognition is a set of melfrequency cepstral coefficients (MFCCs) (which perceptually match some of the characteristics of thespectral analysis done in the human auditory system)(Davis and Mermelstein, 1980), along with the firstand second-order derivatives of these features. Typically about 13 MFCCs and their first and secondderivatives (Furai, 1981) are calculated every 10 ms,leading to a spectral vector with 39 coefficients every10 ms. A block diagram of a typical feature analysisprocess is shown in Figure 5.The speech signal is sampled and quantized, preemphasized by a first-order (highpass) digital filterwith pre-emphasis factor a (to reduce the influenceof glottal coupling and lip radiation on the estimated

6 Speech Recognition: Statistical MethodsFigure 6 Three-state HMM for the sound /s/.Figure 5 Block diagram of feature analysis computation.vocal tract characteristics), segmented into frames,windowed, and then a spectral analysis is performedusing a fast Fourier transform (FFT) (Rabiner andGold, 1975) or linear predictive coding (LPC) method(Atal and Hanauer, 1971; Markel and Gray, 1976).The frequency conversion from a linear frequencyscale to a mel frequency scale is performed in thefiltering block, followed by cepstral analysis yieldingthe MFCCs (Davis and Mermelstein, 1980), equalization to remove any bias and to normalize the cepstral coefficients (Rahim and Juang, 1996), andfinally the computation of first- and second-order(via temporal derivative) MFCCs is made, completingthe feature extraction process.Figure 7 Concatenated model for the word ‘is.’Figure 8 HMM for whole word model with five states.Acoustic ModelsThe goal of acoustic modeling is to characterize thestatistical variability of the feature set determinedabove for each of the basic sounds (or words) of thelanguage. Acoustic modeling uses probability measures to characterize sound realization using statistical models. A statistical method, known as the hiddenMarkov model (HMM) (Levinson et al., 1983;Ferguson, 1980; Rabiner, 1989; Rabiner and Juang,1985), is used to model the spectral variability of eachof the basic sounds of the language using a mixturedensity Gaussian distribution (Juang et al., 1986;Juang, 1985), which is optimally aligned with aspeech training set and iteratively updated and improved (the means, variances, and mixture gains areiteratively updated) until an optimal alignment andmatch is achieved.Figure 6 shows a simple three-state HMM for modeling the subword unit /s/ as spoken at the beginningof the word /six/. Each HMM state is characterized bya probability density function (usually a mixtureGaussian density) that characterizes the statisticalbehavior of the feature vectors at the beginning(state s1), middle (state s2), and end (state s3) of thesound /s/. In order to train the HMM for each subwordunit, we use a labeled training set of words and sentences and utilize an efficient training procedureknown as the Baum-Welch algorithm (Rabiner, 1989;Baum, 1972; Baum et al., 1970) to align each of thevarious subword units with the spoken inputs, and thenestimate the appropriate means, covariances, and mixture gains for the distributions in each subword unitstate. The algorithm is a hill-climbing algorithm and isiterated until a stable alignment of subword unit models and speech is obtained, enabling the creation ofstable models for each subword unit.Figure 7 shows how a simple two-sound word, ‘is,’which consists of the sounds /ih/ and /z/, is created byconcatenating the models (Lee, 1989) for the /ih/sound with the model for the /z/ sound, therebycreating a six-state model for the word ‘is.’Figure 8 shows how an HMM can be used to characterize a whole-word model (Lee et al., 1989). In this

Speech Recognition: Statistical Methods 7case, the word is modeled as a sequence of M ¼ 5HMM states, where each state is characterized by amixture density, denoted as bj ðxt Þ where the modelstate is the index j, the feature vector at time t isdenoted as xt , and the mixture density is of the form:bj ðxt Þ ¼KXcjk N ½xt ; mjk ; Ujk k¼1xt ¼ ðxt1 ; xt2 ; . . . ; xtD Þ; D ¼ 39K ¼ number of mixture components inthe density functioncjk ¼ weight of kth mixture component instate j; cjk 0N ¼ Gaussian density functionmjk ¼ mean vector for mixture k; state jUjk ¼ covariance matrix for mixture k;state jKXcjk ¼ 1;1jMk¼1Z1bj ðxt Þdxt ¼ 1;1jM 1Included in Figure 8 is an explicit set of state transitions, aij , which specify the probability of making atransition from state i to state j at each frame, therebydefining the time sequence of the feature vectors overthe duration of the word. Usually the self-transitions,aii, are large (close to 1.0), and the skip-state transitions, a13, a24, a35, are small (close to 0).Once the set of state transitions and state probability densities are specified, we say that a model l(which is also used to denote the set of parametersthat define the probability measure) has been createdfor the word or subword unit. (The model l is oftenwritten as lðA; B; pÞ to explicitly denote the modelparameters, namely A ¼ faij ; 1 i; j Mg, which isthe state transition matrix, B ¼ fbj ðxt Þ; 1 j Mg,which is the state observation probability density,and p ¼ fpi ; 1 i Mg, which is the initial state distribution). In order to optimally train the variousmodels (for each word unit [Lee et al., 1989] or subword unit [Lee, 1989]), we need to have algorithmsthat perform the following three steps or tasks (Rabiner and Juang, 1985) using the acoustic observationsequence, X, and the model l:a. likelihood evaluation: compute PðXjlÞb. decoding: choose the optimal state sequence for agiven speech utterancec. re-estimation: adjust the parameters of l to maximize PðXjlÞ.Each of these three steps is essential to definingthe optimal HMM models for speech recognitionbased on the available training data and each taskif approached in a brute force manner would becomputationally costly. Fortunately, efficient algorithms have been developed to enable efficient andaccurate solutions to each of the three steps thatmust be performed to train and utilize HMM modelsin a speech recognition system. These are generallyreferred to as the forward-backward algorithm orthe Baum-Welch re-estimation method (Levinsonet al., 1983). Details of the Baum-Welch procedureare beyond the scope of this article. The heart of thetraining procedure for re-estimating model parameters using the Baum-Welch procedure is shown inFigure 9.Recently, the fundamental statistical method, whilesuccessful for a range of conditions, has beenaugmented with a number of techniques that attemptto further enhance the recognition accuracy and makethe recognizer more robust to different talkers, background noise conditions, and channel effects. Onefamily of such techniques focuses on transformationof the observed or measured features. The transformation is motivated by the need for vocal tract lengthnormalization (e.g., reducing the impact of differencesin vocal tract length of various speakers). Anothersuch transformation (called the maximum likelihoodlinear regression method) can be embedded in thestatistical model to account for a potential mismatchbetween the statistical characteristics of the training data and the actual unknown utterances tobe recognized. Yet another family of techniques(e.g., the discriminative training method based onminimum classification error [MCE] or maximummutual information [MMI]) aims at direct minimization of the recognition error during the parameteroptimization stage.Word LexiconThe purpose of the word lexicon, or dictionary, is todefine the range of pronunciation of words in the taskvocabulary (Jurafsky and Martin, 2000; Riley et al.,1999). The reason that such a word lexicon is necessary is because the same orthography can be pronounced differently by people with different accents,or because the word has multiple meanings thatchange the pronunciation by the context of its use.For example, the word ‘data’ can be pronounced as:/d/ /ae/ /t/ /ax/ or as /d/ /ey/ /t/ /ax/, and we would need

8 Speech Recognition: Statistical MethodsFigure 9 The Baum-Welch training procedure.both pronunciations in the dictionary to properlytrain the recognizer models and to properly recognizethe word when spoken by different individuals. Another example of variability in pronunciation fromorthography is the word ‘record,’ which can be eithera disk that goes on a player, or the process of capturing and storing a signal (e.g., audio or video). Thedifferent meanings have significantly different pronunciations. As in the statistical language model, theword lexicon (consisting of sequences of symbols) canbe associated with probability assignments, resultingin a probabilistic word lexicon.Language ModelThe purpose of the language model (Rosenfeld, 2000;Jelinek et al., 1991), or grammar, is to provide a tasksyntax that defines acceptable spoken input sentencesand enables the computation of the probability of theword string, W, given the language model, i.e.,PL ðWÞ. There are several methods of creating wordgrammars, including the use of rule-based systems(i.e., deterministic grammars that are knowledgedriven), and statistical methods that compute an estimate of word probabilities from large training sets oftextual material. We describe the way in which astatistical N-gram word grammar is constructedfrom a large training set of text.Assume we have a large text training set of labeledwords. Thus for every sentence in the training set,we have a text file that identifies the words in thatsentence. If we consider the class of N-gram wordgrammars, then we can estimate the word probabilities from the labeled text training set using countingmethods. Thus to estimate word trigram probabilities(that is the probability that a word wi was precededby the pair of words ðwi 1 ; wi 2 Þ), we compute thisquantity as:Pðwi jwi 1 ; wi 2 Þ ¼Cðwi 2 ; wi 1 ; wi ÞCðwi 2 ; wi 1 Þwhere Cðwi 2 ; wi 1 ; wi Þ is the frequency count ofthe word triplet (i.e., trigram) consisting ofðwi 2 ; wi 1 ; wi Þ that occurred in the training set, andCðwi 2 ; wi 1 Þ is the frequency count of the wordduplet (i.e., bigram) ðwi 2 ; wi 1 Þ that occurred in thetraining set.Although the method of training N-gram wordgrammars, as described above, generally works quitewell, it suffers from the problem that the counts ofN-grams are often highly in error due to problems ofdata sparseness in the training set. Hence for a texttraining set of millions of words, and a word vocabulary of several thousand words, more than 50% ofword trigrams are likely to occur either once or not atall in the training set. This leads to gross distortions inthe computation of the probability of a word string,as required by the basic Bayesian recognition algorithm. In the cases when a word trigram does notoccur at all in the training set, it is unacceptable todefine the trigram probability as 0 (as would be required by the direct definition above), since this leadsto effectively invalidating all strings with that particular trigram from occurring in recogn

ferred to as the speech dialog circle, using an example in the telecommunications context. The customer initially makes a request by speaking an utterance that is sent to a machine, which attempts to recognize, on a word-by-word basis, the spoken speech. The process of recognizing the words in the speech is called automatic speech recognition (ASR)

Related Documents:

speech recognition has acts an important role at present. Using the speech recognition system not only improves the efficiency of the daily life, but also makes people's life more diversified. 1.2 The history and status quo of Speech Recognition The researching of speech recognition technology is started in 1950s. H . Dudley who had

Title: Arabic Speech Recognition Systems Author: Hamda M. M. Eljagmani Advisor: Veton Këpuska, Ph.D. Arabic automatic speech recognition is one of the difficult topics of current speech recognition research field. Its difficulty lies on rarity of researches related to Arabic speech recognition and the data available to do the experiments.

speech or audio processing system that accomplishes a simple or even a complex task—e.g., pitch detection, voiced-unvoiced detection, speech/silence classification, speech synthesis, speech recognition, speaker recognition, helium speech restoration, speech coding, MP3 audio coding, etc. Every student is also required to make a 10-minute

Speech Recognition Helge Reikeras Introduction Acoustic speech Visual speech Modeling Experimental results Conclusion Introduction 1/2 What? Integration of audio and visual speech modalities with the purpose of enhanching speech recognition performance. Why? McGurk effect (e.g. visual /ga/ combined with an audio /ba/ is heard as /da/)

The task of Speech Recognition involves mapping of speech signal to phonemes, words. And this system is more commonly known as the "Speech to Text" system. It could be text independent or dependent. The problem in recognition systems using speech as the input is large variation in the signal characteristics.

To reduce the gap between performance of traditional speech recognition systems and human speech recognition skills, a new architecture is required. A system that is capable of incremental learning offers one such solution to this problem. This thesis introduces a bottom-up approach for such a speech processing system, consisting of a novel .

Speech Enhancement Speech Recognition Speech UI Dialog 10s of 1000 hr speech 10s of 1,000 hr noise 10s of 1000 RIR NEVER TRAIN ON THE SAME DATA TWICE Massive . Spectral Subtraction: Waveforms. Deep Neural Networks for Speech Enhancement Direct Indirect Conventional Emulation Mirsamadi, Seyedmahdad, and Ivan Tashev. "Causal Speech

Alex Rider [7] Horowitz, Anthony Walker Books Ltd (2008) Rating: Product Description Alex Rider bites back. Splashing down off the coast of Australia, Alex is soon working undercover - this time for ASIS, the Australian Secret Service - on a mission to infiltrate the criminal underworld of South-East Asia: the ruthless world of the Snakehead. Faced with an old enemy and .