Speech Recognition: Statistical Methods

1y ago

7 Views

2 Downloads

627.69 KB

18 Pages

Last View : 2d ago

Last Download : 3m ago

Upload by : Abram Andresen

Report this link

Download PDF

Transcription

Speech Recognition: Statistical Methods 1Speech Recognition: Statistical MethodsL R Rabiner, Rutgers University, New Brunswick,NJ, USA and University of California, Santa Barbara,CA, USAB-H Juang, Georgia Institute of Technology, Atlanta,GA, USAß 2006 Elsevier Ltd. All rights reserved.IntroductionThe goal of getting a machine to understand fluentlyspoken speech and respond in a natural voice hasbeen driving speech research for more than 50 years.Although the personification of an intelligent machine such as HAL in the movie 2001, A Space Odyssey, or R2D2 in the Star Wars series, has been aroundfor more than 35 years, we are still not yet at the pointwhere machines reliably understand fluent speech,spoken by anyone, and in any acoustic environment.In spite of the remaining technical problems that needto be solved, the fields of automatic speech recognition and understanding have made tremendousadvances and the technology is now readily availableand used on a day-to-day basis in a number of applications and services – especially those conductedover the public-switched telephone network (PSTN)(Cox et al., 2000). This article aims at reviewing thetechnology that has made these applications possible.Speech recognition and language understanding aretwo major research thrusts that have traditionallybeen approached as problems in linguistics and acoustic phonetics, where a range of acoustic phoneticknowledge has been brought to bear on the problemwith remarkably little success. In this article, however, we focus on statistical methods for speech andlanguage processing, where the knowledge abouta speech signal and the language that it expresses,together with practical uses of the knowledge, is developed from actual realizations of speech datathrough a well-defined mathematical and statisticalformalism. We review how the statistical methods areused for speech recognition and language understanding, show current performance on a number oftask-specific applications and services, and discussthe challenges that remain to be solved before thetechnology becomes ubiquitous.The Speech AdvantageThere are fundamentally three major reasons why somuch research and effort has gone into the problemof trying to teach machines to recognize and understand fluent speech, and these are the following:. Cost reduction. Among the earliest goals for speechrecognition systems was to replace humansperforming certain simple tasks with automatedmachines, thereby reducing labor expenseswhile still providing customers with a natural andconvenient way to access information and services.One simple example of a cost reduction system wasthe Voice Recognition Call Processing (VRCP) system introduced by AT&T in 1992 (Roe et al.,1996), which essentially automated so-called operator-assisted calls, such as person-to-person calls,reverse-billing calls, third-party billing calls, collectcalls (by far the most common class of such calls),and operator-assisted calls. The resulting automation eliminated about 6600 jobs, while providing aquality of service that matched or exceeded thatprovided by the live attendants, saving AT&T onthe order of 300 million per year. New revenue opportunities. Speech recognitionand understanding systems enabled service providers to have a 24/7 high-quality customer careautomation capability, without the need for accessto information by keyboard or touch-tone buttonpushes. An example of such a service was the HowMay I Help You (HMIHY)ß service introduced byAT&T late in 2000 (Gorin et al., 1996), whichautomated the customer care for AT&T ConsumerServices. This system will be discussed further inthe section on speech understanding. A second example of such a service was the NTT ANSER service for voice banking in Japan [Sugamura et al.,1994], which enabled Japanese banking customersto access bank account records from an ordinarytelephone without having to go to the bank. (Ofcourse, today we utilize the Internet for such information, but in 1981, when this system was introduced, the only way to access such records was aphysical trip to the bank and a wait in lines to speakto a banking clerk.). Customer retention. Speech recognition providesthe potential for personalized services based oncustomer preferences, and thereby the potential toimprove the customer experience. A trivial example of such a service is the voice-controlled automotive environment that recognizes the identity ofthe driver from voice commands and adjuststhe automobile’s features (seat position, radio station, mirror positions, etc.) to suit the customer’spreference (which is established in an enrollmentsession).The Speech Dialog CircleWhen we consider the problem of communicatingwith a machine, we must consider the cycle of eventsthat occurs between a spoken utterance (as part of

2 Speech Recognition: Statistical Methodsa dialog between a person and a machine) and theresponse to that utterance from the machine. Figure 1shows such a sequence of events, which is often referred to as the speech dialog circle, using an examplein the telecommunications context.The customer initially makes a request by speakingan utterance that is sent to a machine, which attemptsto recognize, on a word-by-word basis, the spokenspeech. The process of recognizing the words in thespeech is called automatic speech recognition (ASR)and its output is an orthographic representation ofthe recognized spoken input. The ASR process will bediscussed in the next section. Next the spoken wordsare analyzed by a spoken language understanding(SLU) module, which attempts to attribute meaningto the spoken words. The meaning that is attributed isin the context of the task being handled by the speechdialog system. (What is described here is traditionallyreferred to as a limited domain understanding systemor application.) Once meaning has been determined,the dialog management (DM) module examines thestate of the dialog according to a prescribed operational workflow and determines the course of actionthat would be most appropriate to take. The actionmay be as simple as a request for further informationor confirmation of an action that is taken. Thus ifthere were confusion as to how best to proceed, a textquery would be generated by the spoken languagegeneration module to hopefully clarify the meaningand help determine what to do next. The query textFigure 1 The conventional speech dialog circle.is then sent to the final module, the text-to-speechsynthesis (TTS) module, and then converted intointelligible and highly natural speech, which is sentto the customer who decides what to say next basedon what action was taken, or based on previous dialogs with the machine. All of the modules in the speechdialog circle can be ‘data-driven’ in both the learning and active use phases, as indicated by the centralData block in Figure 1.A typical task scenario, e.g., booking an airlinereservation, requires navigating the speech dialog circle many times – each time being referred to as one‘turn’ – to complete a transaction. (The average number of turns a machine takes to complete a prescribedtask is a measure of the effectiveness of the machinein many applications.) Hopefully, each time throughthe dialog circle enables the customer to get closer tothe desired action either via proper understandingof the spoken request or via a series of clarificationsteps. The speech dialog circle is a powerful conceptin modern speech recognition and understanding systems, and is at the heart of most speech understandingsystems that are in use today.Basic ASR FormulationThe goal of an ASR system is to accurately and efficiently convert a speech signal into a text messagetranscription of the spoken words, independentof the device used to record the speech (i.e., the

Speech Recognition: Statistical Methods 3transducer or microphone), the speaker, or the environment. A simple model of the speech generationprocess, as used to convey a speaker’s intention isshown in Figure 2.It is assumed that the speaker decides what to sayand then embeds the concept in a sentence, W, whichis a sequence of words (possibly with pauses andother acoustic events such as uh’s, um’s, er’s, etc.).The speech production mechanisms then produce aspeech waveform, sðnÞ, which embodies the words ofW as well as the extraneous sounds and pauses inthe spoken input. A conventional automatic speechrecognizer attempts to decode the speech, sðnÞ, intothe best estimate of the sentence, Ŵ, using a two-stepprocess, as shown in Figure 3.The first step in the process is to convert the speechsignal, sðnÞ, into a sequence of spectral feature vectors, X, where the feature vectors are measured every10 ms (or so) throughout the duration of the speechsignal. The second step in the process is to use asyntactic decoder to generate every possible validsentence (as a sequence of orthographic representations) in the task language, and to evaluate the score(i.e., the a posteriori probability of the word stringgiven the realized acoustic signal as measured by thefeature vector) for each such string, choosing as therecognized string, Ŵ, the one with the highest score.This is the so-called maximum a posteriori probability (MAP) decision principle, originally suggestedby Bayes. Additional linguistic processing can bedone to try to determine side information about thespeaker, such as the speaker’s intention, as indicatedin Figure 3.Mathematically, we seek to find the string Ŵ thatmaximizes the a posteriori probability of that string,when given the measured feature vector X, i.e.,Figure 2 Model of spoken speech.Figure 3 ASR decoder from speech to sentence.Ŵ ¼ arg max PðWjXÞWUsing Bayes Law, we can rewrite this expression as:Ŵ ¼ arg maxWPðXjWÞPðWÞPðXÞThus, calculation of the a posteriori probability isdecomposed into two main components, one thatdefines the a priori probability of a word sequenceW, P(W), and the other the likelihood of the wordstring W in producing the measured feature vector,P(X W). (We disregard the denominator term, PðXÞ,since it is independent of the unknown W). The latteris referred to as the acoustic model, PA ðXjWÞ, and theformer the language model, PL ðWÞ (Rabiner et al.,1996; Gauvain and Lamel, 2003). We note that thesequantities are not given directly, but instead are usually estimated or inferred from a set of trainingdata that have been labeled by a knowledge source,i.e., a human expert. The decoding equation is thenrewritten as:Ŵ ¼ arg max PA ðXjWÞPL ðWÞWWe explicitly write the sequence of feature vectors(the acoustic observations) as:X ¼ x1 ; x2 ; . . . ;xNwhere the speech signal duration is N frames (or Ntimes 10 ms when the frame shift is 10 ms). Similarly,we explicitly write the optimally decoded wordsequence as:Ŵ ¼ w1 w2 . . . wMwhere there are M words in the decoded string. Theabove decoding equation defines the fundamentalstatistical approach to the problem of automaticspeech recognition.It can be seen that there are three steps to the basicASR formulation, namely:. Step 1: acoustic modeling for assigning probabilities to acoustic (spectral) realizations of a sequence

4 Speech Recognition: Statistical Methodsof words. For this step we use a statistical model(called the hidden Markov model or HMM) of theacoustic signals of either individual words or subword units (e.g., phonemes) to compute the quantity PA ðXjWÞ. We train the acoustic models from atraining set of speech utterances, which have beenappropriately labeled to establish the statisticalrelationship between X and W. Step 2: language modeling for assigning probabilities, PL ðWÞ, to sequences of words that form validsentences in the language and are consistent withthe recognition task being performed. We trainsuch language models from generic text sequences,or from transcriptions of task-specific dialogues.(Note that a deterministic grammar, as is used inmany simple tasks, can be considered a degenerateform of a statistical language model. The ‘coverage’of a deterministic grammar is the set of permissibleword sequences, i.e., expressions that are deemedlegitimate.). Step 3: hypothesis search whereby we find theword sequence with the maximum a posterioriprobability by searching through all possibleword sequences in the language.In step 1, acoustic modeling (Young, 1996; Rabineret al., 1986), we train a set of acoustic models for thewords or sounds of the language by learning thestatistics of the acoustic features, X, for each wordor sound, from a speech training set, where we compute the variability of the acoustic features during theproduction of the words or sounds, as represented bythe models. For large vocabulary tasks, it is impractical to create a separate acoustic model for everypossible word in the language since it requires fartoo much training data to measure the variability inevery possible context. Instead, we train a set of about50 acoustic-phonetic subword models for the approximately 50 phonemes in the English language,and construct a model for a word by concatenating(stringing together sequentially) the models for theconstituent subword sounds in the word, as definedin a word lexicon or dictionary, where multiple pronunciations are allowed). Similarly, we build sentences (sequences of words) by concatenating wordmodels. Since the actual pronunciation of a phonememay be influenced by neighboring phonemes (thoseoccurring before and after the phoneme), the set ofso-called context-dependent phoneme models areoften used as the speech models, as long as sufficientdata are collected for proper training of these models.In step 2, the language model (Jelinek, 1997;Rosenfeld, 2000) describes the probability of a sequence of words that form a valid sentence in the tasklanguage. A simple statistical method works well,based on a Markovian assumption, namely that theprobability of a word in a sentence is conditioned ononly the previous N–1 words, namely an N-gramlanguage model, of the form:PL ðWÞ ¼ PL ðw1 ; w2 ; . . . ; wM Þ¼MYPL ðwm jwm 1 ; wm 2 ; . . . ; wm Nþ1 Þm¼1where PL ðwm jwm 1 ; wm 2 ; . . . ; wm Nþ1 Þ is estimatedby simply counting up the relative frequencies ofN-tuples in a large corpus of text.In step 3, the search problem (Ney, 1984; Paul,2001) is one of searching the space of all validsound sequences, conditioned on the word grammar,the language syntax, and the task constraints, to findthe word sequence with the maximum likelihood.The size of the search space can be astronomicallylarge and take inordinate amounts of computingpower to solve by heuristic methods. The use ofmethods from the field of Finite State AutomataTheory provide finite state networks (FSNs)(Mohri, 1997), along with the associated searchpolicy based on dynamic programming, that reducethe computational burden by orders of magnitude,thereby enabling exact solutions in computationally feasible times, for large speech recognitionproblems.Development of a Speech RecognitionSystem for a Task or an ApplicationBefore going into more detail on the various aspectsof the process of automatic speech recognition bymachine, we review the three steps that must occurin order to define, train, and build an ASR system(Juang et al., 1995; Kam and Helander, 1997). Thesesteps are the following:. Step 1: choose the recognition task. Specify theword vocabulary for the task, the set of units thatwill be modeled by the acoustic models (e.g., wholewords, phonemes, etc.), the word pronunciationlexicon (or dictionary) that describes the variationsin word pronunciation, the task syntax (grammar),and the task semantics. By way of example, for asimple speech recognition system capable of recognizing a spoken credit card number using isolateddigits (i.e., single digits spoken one at a time), thesounds to be recognized are either whole wordsor the set of subword units that appear in the digits/zero/ to /nine/ plus the word /oh/. The word vocabulary is the set of 11 digits. The task syntaxallows any single digit to be spoken, and the task

Speech Recognition: Statistical Methods 5Figure 4 Framework of ASR system.semantics specify that a sequence of isolated digitsmust form a valid credit card code for identifyingthe user. Step 2: train the models. Create a method for building acoustic word models (or subword models) froma labeled speech training data set of multiple occurrences of each of the vocabulary words by one ormore speakers. We also must use a text trainingdata set to create a word lexicon (dictionary) describing the ways that each word can be pronounced(assuming we are using subword units to characterize individual words), a word grammar (or languagemodel) that describes how words are concatenatedto form valid sentences (i.e., credit card numbers),and finally a task grammar that describes whichvalid word strings are meaningful in the taskapplication (e.g., valid credit card numbers). Step 3: evaluate recognizer performance. We needto determine the word error rate and the task errorrate for the recognizer on the desired task. For anisolated digit recognition task, the word error rateis just the isolated digit error rate, whereas the taskerror rate would be the number of credit carderrors that lead to misidentification of the user.Evaluation of the recognizer performance oftenincludes an analysis of the types of recognitionerrors made by the system. This analysis can leadto revision of the task in a number of ways, rangingfrom changing the vocabulary words or the grammar (i.e., to eliminate highly confusable words) tothe use of word spotting, as opposed to word transcription. As an example, in limited vocabularyapplications, if the recognizer encounters frequentconfusions between words like ‘freight’ and ‘flight,’it may be advisable to change ‘freight’ to ‘cargo’ tomaximize its distinction from ‘flight.’ Revision ofthe task grammar often becomes necessary if therecognizer experiences substantial amounts ofwhat is called ‘out of grammar’ (OOG) utterances,namely the use of words and phrases that are notdirectly included in the task vocabulary (ISCA,2001).The Speech Recognition ProcessIn this section, we provide some technical aspects of atypical speech recognition system. Figure 4 shows ablock diagram of a speech recognizer that follows theBayesian framework discussed above.The recognizer consists of three processing steps,namely feature analysis, pattern matching, and confidence scoring, along with three trained databases, theset of acoustic models, the word lexicon, and thelanguage model. In this section, we briefly describeeach of the processing steps and each of the trainedmodel databases.Feature AnalysisThe goal of feature analysis is to extract a set ofsalient features that characterize the spectral properties of the various speech sounds (the subword units)and that can be efficiently measured. The ‘standard’feature set for speech recognition is a set of melfrequency cepstral coefficients (MFCCs) (which perceptually match some of the characteristics of thespectral analysis done in the human auditory system)(Davis and Mermelstein, 1980), along with the firstand second-order derivatives of these features. Typically about 13 MFCCs and their first and secondderivatives (Furai, 1981) are calculated every 10 ms,leading to a spectral vector with 39 coefficients every10 ms. A block diagram of a typical feature analysisprocess is shown in Figure 5.The speech signal is sampled and quantized, preemphasized by a first-order (highpass) digital filterwith pre-emphasis factor a (to reduce the influenceof glottal coupling and lip radiation on the estimated

6 Speech Recognition: Statistical MethodsFigure 6 Three-state HMM for the sound /s/.Figure 5 Block diagram of feature analysis computation.vocal tract characteristics), segmented into frames,windowed, and then a spectral analysis is performedusing a fast Fourier transform (FFT) (Rabiner andGold, 1975) or linear predictive coding (LPC) method(Atal and Hanauer, 1971; Markel and Gray, 1976).The frequency conversion from a linear frequencyscale to a mel frequency scale is performed in thefiltering block, followed by cepstral analysis yieldingthe MFCCs (Davis and Mermelstein, 1980), equalization to remove any bias and to normalize the cepstral coefficients (Rahim and Juang, 1996), andfinally the computation of first- and second-order(via temporal derivative) MFCCs is made, completingthe feature extraction process.Figure 7 Concatenated model for the word ‘is.’Figure 8 HMM for whole word model with five states.Acoustic ModelsThe goal of acoustic modeling is to characterize thestatistical variability of the feature set determinedabove for each of the basic sounds (or words) of thelanguage. Acoustic modeling uses probability measures to characterize sound realization using statistical models. A statistical method, known as the hiddenMarkov model (HMM) (Levinson et al., 1983;Ferguson, 1980; Rabiner, 1989; Rabiner and Juang,1985), is used to model the spectral variability of eachof the basic sounds of the language using a mixturedensity Gaussian distribution (Juang et al., 1986;Juang, 1985), which is optimally aligned with aspeech training set and iteratively updated and improved (the means, variances, and mixture gains areiteratively updated) until an optimal alignment andmatch is achieved.Figure 6 shows a simple three-state HMM for modeling the subword unit /s/ as spoken at the beginningof the word /six/. Each HMM state is characterized bya probability density function (usually a mixtureGaussian density) that characterizes the statisticalbehavior of the feature vectors at the beginning(state s1), middle (state s2), and end (state s3) of thesound /s/. In order to train the HMM for each subwordunit, we use a labeled training set of words and sentences and utilize an efficient training procedureknown as the Baum-Welch algorithm (Rabiner, 1989;Baum, 1972; Baum et al., 1970) to align each of thevarious subword units with the spoken inputs, and thenestimate the appropriate means, covariances, and mixture gains for the distributions in each subword unitstate. The algorithm is a hill-climbing algorithm and isiterated until a stable alignment of subword unit models and speech is obtained, enabling the creation ofstable models for each subword unit.Figure 7 shows how a simple two-sound word, ‘is,’which consists of the sounds /ih/ and /z/, is created byconcatenating the models (Lee, 1989) for the /ih/sound with the model for the /z/ sound, therebycreating a six-state model for the word ‘is.’Figure 8 shows how an HMM can be used to characterize a whole-word model (Lee et al., 1989). In this

Speech Recognition: Statistical Methods 7case, the word is modeled as a sequence of M ¼ 5HMM states, where each state is characterized by amixture density, denoted as bj ðxt Þ where the modelstate is the index j, the feature vector at time t isdenoted as xt , and the mixture density is of the form:bj ðxt Þ ¼KXcjk N ½xt ; mjk ; Ujk k¼1xt ¼ ðxt1 ; xt2 ; . . . ; xtD Þ; D ¼ 39K ¼ number of mixture components inthe density functioncjk ¼ weight of kth mixture component instate j; cjk 0N ¼ Gaussian density functionmjk ¼ mean vector for mixture k; state jUjk ¼ covariance matrix for mixture k;state jKXcjk ¼ 1;1jMk¼1Z1bj ðxt Þdxt ¼ 1;1jM 1Included in Figure 8 is an explicit set of state transitions, aij , which specify the probability of making atransition from state i to state j at each frame, therebydefining the time sequence of the feature vectors overthe duration of the word. Usually the self-transitions,aii, are large (close to 1.0), and the skip-state transitions, a13, a24, a35, are small (close to 0).Once the set of state transitions and state probability densities are specified, we say that a model l(which is also used to denote the set of parametersthat define the probability measure) has been createdfor the word or subword unit. (The model l is oftenwritten as lðA; B; pÞ to explicitly denote the modelparameters, namely A ¼ faij ; 1 i; j Mg, which isthe state transition matrix, B ¼ fbj ðxt Þ; 1 j Mg,which is the state observation probability density,and p ¼ fpi ; 1 i Mg, which is the initial state distribution). In order to optimally train the variousmodels (for each word unit [Lee et al., 1989] or subword unit [Lee, 1989]), we need to have algorithmsthat perform the following three steps or tasks (Rabiner and Juang, 1985) using the acoustic observationsequence, X, and the model l:a. likelihood evaluation: compute PðXjlÞb. decoding: choose the optimal state sequence for agiven speech utterancec. re-estimation: adjust the parameters of l to maximize PðXjlÞ.Each of these three steps is essential to definingthe optimal HMM models for speech recognitionbased on the available training data and each taskif approached in a brute force manner would becomputationally costly. Fortunately, efficient algorithms have been developed to enable efficient andaccurate solutions to each of the three steps thatmust be performed to train and utilize HMM modelsin a speech recognition system. These are generallyreferred to as the forward-backward algorithm orthe Baum-Welch re-estimation method (Levinsonet al., 1983). Details of the Baum-Welch procedureare beyond the scope of this article. The heart of thetraining procedure for re-estimating model parameters using the Baum-Welch procedure is shown inFigure 9.Recently, the fundamental statistical method, whilesuccessful for a range of conditions, has beenaugmented with a number of techniques that attemptto further enhance the recognition accuracy and makethe recognizer more robust to different talkers, background noise conditions, and channel effects. Onefamily of such techniques focuses on transformationof the observed or measured features. The transformation is motivated by the need for vocal tract lengthnormalization (e.g., reducing the impact of differencesin vocal tract length of various speakers). Anothersuch transformation (called the maximum likelihoodlinear regression method) can be embedded in thestatistical model to account for a potential mismatchbetween the statistical characteristics of the training data and the actual unknown utterances tobe recognized. Yet another family of techniques(e.g., the discriminative training method based onminimum classification error [MCE] or maximummutual information [MMI]) aims at direct minimization of the recognition error during the parameteroptimization stage.Word LexiconThe purpose of the word lexicon, or dictionary, is todefine the range of pronunciation of words in the taskvocabulary (Jurafsky and Martin, 2000; Riley et al.,1999). The reason that such a word lexicon is necessary is because the same orthography can be pronounced differently by people with different accents,or because the word has multiple meanings thatchange the pronunciation by the context of its use.For example, the word ‘data’ can be pronounced as:/d/ /ae/ /t/ /ax/ or as /d/ /ey/ /t/ /ax/, and we would need

8 Speech Recognition: Statistical MethodsFigure 9 The Baum-Welch training procedure.both pronunciations in the dictionary to properlytrain the recognizer models and to properly recognizethe word when spoken by different individuals. Another example of variability in pronunciation fromorthography is the word ‘record,’ which can be eithera disk that goes on a player, or the process of capturing and storing a signal (e.g., audio or video). Thedifferent meanings have significantly different pronunciations. As in the statistical language model, theword lexicon (consisting of sequences of symbols) canbe associated with probability assignments, resultingin a probabilistic word lexicon.Language ModelThe purpose of the language model (Rosenfeld, 2000;Jelinek et al., 1991), or grammar, is to provide a tasksyntax that defines acceptable spoken input sentencesand enables the computation of the probability of theword string, W, given the language model, i.e.,PL ðWÞ. There are several methods of creating wordgrammars, including the use of rule-based systems(i.e., deterministic grammars that are knowledgedriven), and statistical methods that compute an estimate of word probabilities from large training sets oftextual material. We describe the way in which astatistical N-gram word grammar is constructedfrom a large training set of text.Assume we have a large text training set of labeledwords. Thus for every sentence in the training set,we have a text file that identifies the words in thatsentence. If we consider the class of N-gram wordgrammars, then we can estimate the word probabilities from the labeled text training set using countingmethods. Thus to estimate word trigram probabilities(that is the probability that a word wi was precededby the pair of words ðwi 1 ; wi 2 Þ), we compute thisquantity as:Pðwi jwi 1 ; wi 2 Þ ¼Cðwi 2 ; wi 1 ; wi ÞCðwi 2 ; wi 1 Þwhere Cðwi 2 ; wi 1 ; wi Þ is the frequency count ofthe word triplet (i.e., trigram) consisting ofðwi 2 ; wi 1 ; wi Þ that occurred in the training set, andCðwi 2 ; wi 1 Þ is the frequency count of the wordduplet (i.e., bigram) ðwi 2 ; wi 1 Þ that occurred in thetraining set.Although the method of training N-gram wordgrammars, as described above, generally works quitewell, it suffers from the problem that the counts ofN-grams are often highly in error due to problems ofdata sparseness in the training set. Hence for a texttraining set of millions of words, and a word vocabulary of several thousand words, more than 50% ofword trigrams are likely to occur either once or not atall in the training set. This leads to gross distortions inthe computation of the probability of a word string,as required by the basic Bayesian recognition algorithm. In the cases when a word trigram does notoccur at all in the training set, it is unacceptable todefine the trigram probability as 0 (as would be required by the direct definition above), since this leadsto effectively invalidating all strings with that particular trigram from occurring in recogn

ferred to as the speech dialog circle, using an example in the telecommunications context. The customer initially makes a request by speaking an utterance that is sent to a machine, which attempts to recognize, on a word-by-word basis, the spoken speech. The process of recognizing the words in the speech is called automatic speech recognition (ASR)

Related Documents:

Research and simulation on speech recognition by Matlab - DiVA portal

speech recognition has acts an important role at present. Using the speech recognition system not only improves the efficiency of the daily life, but also makes people's life more diversified. 1.2 The history and status quo of Speech Recognition The researching of speech recognition technology is started in 1950s. H . Dudley who had

12 Views

1y ago

Arabic Speech Recognition Systems

Title: Arabic Speech Recognition Systems Author: Hamda M. M. Eljagmani Advisor: Veton Këpuska, Ph.D. Arabic automatic speech recognition is one of the difficult topics of current speech recognition research field. Its difficulty lies on rarity of researches related to Arabic speech recognition and the data available to do the experiments.

12 Views

1y ago

Digital Speech Processing - UC Santa Barbara

speech or audio processing system that accomplishes a simple or even a complex task—e.g., pitch detection, voiced-unvoiced detection, speech/silence classification, speech synthesis, speech recognition, speaker recognition, helium speech restoration, speech coding, MP3 audio coding, etc. Every student is also required to make a 10-minute

127 Views

3y ago

Audio-Visual Automatic Speech Recognition

Speech Recognition Helge Reikeras Introduction Acoustic speech Visual speech Modeling Experimental results Conclusion Introduction 1/2 What? Integration of audio and visual speech modalities with the purpose of enhanching speech recognition performance. Why? McGurk eﬀect (e.g. visual /ga/ combined with an audio /ba/ is heard as /da/)

15 Views

1y ago

Speech Enhancement Using PCA for Speech and Emotion Recognition

The task of Speech Recognition involves mapping of speech signal to phonemes, words. And this system is more commonly known as the "Speech to Text" system. It could be text independent or dependent. The problem in recognition systems using speech as the input is large variation in the signal characteristics.

12 Views

1y ago

Speech Segmentation and Clustering Methods for a New Speech Recognition ...

To reduce the gap between performance of traditional speech recognition systems and human speech recognition skills, a new architecture is required. A system that is capable of incremental learning offers one such solution to this problem. This thesis introduces a bottom-up approach for such a speech processing system, consisting of a novel .

3 Views

1y ago

Creating Deep Learning Based Speech Products in Record Time

Speech Enhancement Speech Recognition Speech UI Dialog 10s of 1000 hr speech 10s of 1,000 hr noise 10s of 1000 RIR NEVER TRAIN ON THE SAME DATA TWICE Massive . Spectral Subtraction: Waveforms. Deep Neural Networks for Speech Enhancement Direct Indirect Conventional Emulation Mirsamadi, Seyedmahdad, and Ivan Tashev. "Causal Speech

37 Views

1y ago

Alex Rider 7 - Snakehead - English Creek

Alex Rider [7] Horowitz, Anthony Walker Books Ltd (2008) Rating: Product Description Alex Rider bites back. Splashing down off the coast of Australia, Alex is soon working undercover - this time for ASIS, the Australian Secret Service - on a mission to infiltrate the criminal underworld of South-East Asia: the ruthless world of the Snakehead. Faced with an old enemy and .

151 Views

3y ago

Recent Views

Legal Proceedings and Legal Privilege Exemptions: Myth-busting - ICO

If asking for legal advice, say so, and start new email chain If giving legal advice, say so Involve lawyers (before litigation contemplated) Maintain confidentiality of legal advice documents Limit dissemination of legal advice (need to know; original only) Make internal communications re legal advice factual

1y ago

247 Views

Smart People Ask for (My) Advice: Seeking Advice Boosts .

advice strategically is likely to be a different experi-ence for the advice seeker than seeking advice with the intention of using it, from the advisor’s perspec-tive, strategic advice seeking may elicit the same per-ceptual effects as authentic advice seeking because the advice seeker’s intentions (and her reliance on advice)

3y ago

182 Views

Legal Action Group The Role of Advice Services in Health Outcomes

The Role of Advice Services in Health Outcomes Evidence Review and Mapping Study June 2015 The Role of Advice Services in Health Outcomes . tor.!Our! r,!

1y ago

176 Views

Legal Information vs Legal Advice Guidelines - TMCEC

giving legal advice. Legal advice is a written or oral statement that: o Interprets some aspect of the law, court rules, or court procedures; o Recommends a specific course of conduct a person should take in an actual or potential legal proceeding; or o Applies the law to the individual person's specific factual circumstances. What is Legal .

1y ago

233 Views

ProQual L2 Certificate Supporting Access to Legal Advice

R/502/7657 Communicating with legal advice clients 2 3 D/503/0822 Supporting clients to make use of the legal advice service 2 3 R/502/7660 Enabling legal advice clients to access signposting and referral opportunities 2 3 Optional Units - a minimum of 6 credits Unit Reference Number Unit Title Unit Level Credit Value

1y ago

182 Views

Guidance for opponents in civil legal aid cases - Scottish Legal Aid Board

injury case - may apply for civil legal aid (since this leaﬂet deals only with civil legal aid, where we refer to "legal aid" we mean "civil legal aid"). Legal aid is ﬁnancial help from public funds. It helps people who qualify to get legal advice and the help of a solicitor to put their case in court.

4m ago

119 Views

Priority Banking Tariff - Standard Chartered

Foreign exchange rate Free Free Free Free Free Free Free Free Free Free Free Free Free Free Free SMS Banking Daily Weekly Monthly. in USD or in other foreign currencies in VND . IDD rates min. VND 85,000 Annual Rental Fee12 Locker size Small Locker size Medium Locker size Large Rental Deposit12,13 Lock replacement

2y ago

210 Views

legal and ethical dimensions of practice - Dovetail

Material in this Guide should never be taken as providing you or any other person with legal advice. Legal advice regarding the application of the law to a particular circumstance or situation can only come from a legal practitioner. A range of sources for legal advice can be found in the Guide.

1y ago

173 Views

How Social Welfare Legal Advice and Social Prescribing can work .

The position of social welfare legal advice and its role in London's recovery The Mayor of London and partners should position social welfare legal advice as a core pillar of Londons recovery from the OVID-19 pandemic, with a core focus on ensuring adequate funding and practical support for advice agencies to ensure ongoing viability.

1y ago

179 Views

WHAT TO DO IF YOU ARE SEXUALLY HARASSED

There are many legal clinics or legal information centres you can contact to obtain legal information, educational resources or legal referrals. Alberta Central Alberta Community Legal Clinic (Red Deer) Centre for Public Legal Education Alberta Pro Bono Law Alberta Women's Centre Legal Advice Clinic (Calgary)

3y ago

250 Views

Legal Advocacy Essentials

Legal Advocacy Essentials: a core training for legal advocates Presented by the Washington State Coalition Against Domestic Violence, 2008. This information is not intended as a substitute for legal advice. 1 Legal Advocacy Essentials . A core training for legal advocates . Table of Contents . What is a legal advocate?

1y ago

259 Views

Legal & Corporate Services: Strategic Plan - CP6

the provision of legal advice, managing legal risk and managing the legal supply chain. By doing this well, the team will move towards its vision. Legal Services is made up of 4 teams, each serving different customers with a dedicated legal resource. This is summarised in the figure right. Although Legal Services has customerdistinct, -focussed .

1y ago

178 Views

Regulatory Guide RG 90 Example Statement of Advice: Scaled advice for a .

representatives and advisers who give personal advice to retail clients. It explains how and why we have developed an example Statement of Advice (SOA) for scaled advice (i.e. personal advice that is limited in scope) on personal insurance for a new retail client. The example SOA was developed in consultation with stakeholders, and we

1y ago

190 Views

Removal of licence disqualification - Legal Aid WA

agencies, permission must first be obtained from Legal Aid Western Australia. This Kit provides information about the law only and does not constitute legal advice. You should seek legal advice if you have a specific legal problem. Every effort is made to ensure that the information contai

3y ago

260 Views

Legal Information vs - txcourts.gov

giving legal advice. Legal advice is a written or oral statement that: Inter p rets some as ect of th elaw, courtles, or du s; Recomme nd s a pecific c ourse of ndu ters h ld k ein an actual or ntial legal proceeding; or 'sApplies th elaw to individu alperso n seci fic actu circums a . What is Legal Information?

1y ago

182 Views

Speech Recognition: Statistical Methods

It looks like you're using an ad-blocker