Research And Simulation On Speech Recognition By Matlab - DiVA Portal

1y ago
12 Views
2 Downloads
1.74 MB
67 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Warren Adams
Transcription

FACULTY OF ENGINEERING AND SUSTAINABLE DEVELOPMENT .Research and simulation on speech recognition byMatlabLinlin PanDec 2013Bachelor’s Thesis in ElectronicsBachelor’s Program in Electronics/TelecommunicationsExaminer: Niklas RothpferfferSupervisor: Lei Wang

Linlin PanResearch and simulation on speech recognition by MatlabAcknowledgementsI would like to express my gratitude to all those who helped me during the thesis work.First, I’d like to thank my examiner, Niklas Rothpferffer who give me suggestions for newtopics and outlines.Then, I gratefully acknowledge the help with Doctor Wang, who has offered me reallyvaluable advices and guidance with the literature screening and experimental Matlabsimulation tutoring during the thesis work.Last my thanks will go to my fellows that help me recording the simulations samples.i

Linlin PanResearch and simulation on speech recognition by MatlabAbstractWith the development of multimedia technology, speech recognition technology hasincreasingly become a hotspot of research in recent years. It has a wide range of applications,which deals with recognizing the identity of the speakers that can be classified into speechidentification and speech verification according to decision modes.The main work of this thesis is to study and research the techniques, algorithms of speechrecognition, thus to create a feasible system to simulate the speech recognition. The researchwork and achievements are as following: First: The author has done a lot of investigation inthe field of speech recognition with the adequate research and study. There are manyalgorithms about speech recognition, to sum up, the algorithms can divided into twocategories, one of them is the direct speech recognition, which means the method canrecognize the words directly, and another prefer the second method that recognition based onthe training model. Second: find a useable and reasonable algorithm and make research aboutthis algorithm. Besides, the author has studied algorithms, which are used to extract theword's characteristic parameters based on MFCC(Mel frequency Cepstrum Coefficients) , andtraining the Characteristic parameters based on the GMM(Gaussian mixture mode) . Third:The author has used the MATLAB software and written a program to implement the speechrecognition algorithm and also used the speech process toolbox in this program. Generallyspeaking, whole system includes the module of the signal process, MFCC characteristicparameter and GMM training. Forth: Simulation and analysis the results. The MATLABsystem will read the wav file, play it first, and then calculate the characteristic parametersautomatically. All content of the speech signal have been distinguished in the last step. In thispaper, the author has recorded speech from different people to test the systems and thesimulation results shown that when the testing environment is quiet enough and the speaker isthe same person to record for 20 times, the performance of the algorithm is approach to 100%for pair of words in different and same syllable. But the result will be influenced when thetesting signal is surrounded with certain noise level. The simulation system won’t work with agood output, when the speaker is not the same one for recording both reference and testingsignal.ii

Linlin PanResearch and simulation on speech recognition by MatlabTable of contentsAcknowledgements . iAbstract . iiTable of contents . iii1Introduction . 11.1Background of Speech Recognition . 11.2 The history and status quo of Speech Recognition . 11.3 Thesis Outline . 31.4 Limitation in experiment . 42Theory . 52.1 Signal sampling . 52.2 Signal Pre-processing . 62.2.1 Endpoint Detection . 62.2.2 Pre emphasis . 82.2.3 Frame Blocking . 92.2.4 Adding Windows. 102.3 The characteristic parameters of speech signal . 132.3.1 MFCC . 142.4 Recognition . 192.4.1 GMM (Gaussian Mixture Model) . 192.53Tools in the experiment. . 23Process and results. 243.1 Process. 243.1.1Flow chart of the experiment. 243.1.2Speech recognition system evaluation criterion . 253.2Result . 263.2.1Pre Process . 26iii

Linlin PanResearch and simulation on speech recognition by Matlab3.2.2MFCC . 283.2.3GMM . 323.3Simulation and Analysis . 343.3.1Algorithm flow chart . 343.3.2Simulation result Analysis. 354Discussion . 455Conclusions . 48Bibliography . 50Appendix A . 1A1.Signal Training . 1A2.Signal Testing . 2A3.fun GMM EM.m. 3A4.func multi gauss.m. 5A5.lsum.m . 5A6.plotspec.m. 6Appendix B . 1B1.MFCC result . 15.23-19.30 , -18.68 , -13.75 , -27.94 , -11.49 , 17.46 , -3.73 , -4.24 , -1.51 , 2.25 , 2.57 , . 2B2.GMM result . 2iv

Linlin PanResearch and simulation on speech recognition by Matlab1 Introduction1.1Background of Speech RecognitionLanguage is an important way of communication for human. The voice characteristicparameters of different people are almost different, such as the loudness, voice amplitude, allof them are different. As an emphasis of this report, speech recognition is a popular topic innowadays life where the applications of it can be found everywhere, which make our lifemore effective. So it will be meaningful and significant to make an academic research with anadequate interpretation and comprehending for algorithms to recognize the speechcharacteristics.Speech recognition technology is a process of extracting the speech characteristicinformation from people's voice, and then been operated through the computer and torecognize the content of the speech. It’s interdisciplinary involving many fields, wheremodern speech recognition technology consist of many domains of technology, such as signalprocessing, theory of information, phonetics, linguistics, artificial intelligence, etc. Over thepast few decades, scholars have done many research about speech recognition technology.With the development of computer, microelectronics and digital signal processing technology,speech recognition has acts an important role at present. Using the speech recognition systemnot only improves the efficiency of the daily life, but also makes people’s life morediversified.1.2 The history and status quo of Speech RecognitionThe researching of speech recognition technology is started in 1950s. H . Dudley who hadsuccessfully developed the first speech coder, established the basic theory of speechrecognition. And it followed by ,J . Rorgie began to research the computer voice recognitionby using the English vowel and isolated words in in 1959. Meanwhile, the BELL labsinvented language Spectrum instrument.In 1960s, Many methods had been provided to research speech recognition , which have asignificant impact for the development of speech recognition researching , one of the keyresearch achievement is the time normalization method put forward by Doctor Martin whichcan solve the problem of detection of speech signal endpoint [1] .1

Linlin PanResearch and simulation on speech recognition by MatlabAnd in 1965, Doctor Tukey invented a famous algorithm, FFT (Fast Fourier Transform)algorithm that can research the signal in the frequency domain, then In 1968, The mostimportant speech recognition technology, dynamic programming technology and linearprediction analysis technology have been invented. [2]There are many models didn’t adopted in the article, which are also significant for speechrecognition including: Hidden Markov Model (HMM), published by Doctor Baum in 70s thatthe speech sequence can been constructed based on Markov chain. The HMM method canwell describe the time-varying and stationarity of speech signals, which can achieves a highermodeling precision and become the starting of continuous speech recognition research.In the mean time, vector quantization (VQ) theory was invented, and linear predictiontechnology was developed more and more perfect. In 1980s, the artificial neural network(ANN) technology has been applied in the field of speech recognition successfully. Theapplication of artificial neural network technology becomes a new way of researching voicerecognition, which has the advantage of non-linearity, robustness, fault tolerance and learningcharacteristics. At the same time, the conjunctions speech recognition algorithms have beenproposed, which makes the speech recognition research start from micro to macro. [3]In this period, the most famous researching achievement is the continuous speechrecognition system SPHINX, proposed by scholar Lee from Carnegie Mellon university of theUnited States in 1988. In the decade of the 21st century, the experts have researched manynew methods of speech recognition in order to use it in the embedded devices. Although,there are also many problems in the real applications, but the Speech recognition technologyis developing faster and faster. Recently, in the field of speech recognition, the direction ofresearchinghasfocused on the spoken dialogue system and the embedded speechrecognition system. Meantime, there are many projects of speech recognition, such as voicerecognition, robust speech recognition, speaker adaptation technology, large vocabularywords recognition, speech recognition reliability evaluation algorithm and so on. The speakeradaptation technology has achieved a big improvement in the fields of voice channelnormalization technology, maximum likelihood linear regression algorithm, bayesian adaptivevalue algorithm etc.Speech recognition technology based on HMM is now developed mature , more and morepeople provided their own method based on it to get a better performance with various ofspeech recognition algorithms. In this field, Doctor Wang from Tsinghua University have putforward inhomogeneous improved hidden markov model of speech recognition. [4] In doctorWang's theory , the traditional HMM model has some problems in the speech recognition2

Linlin PanResearch and simulation on speech recognition by Matlabapplication , and give a long distribution based inhomogeneous hidden markov model (DDBHMM) . Professor Zhao have put forward a hidden markov model by using even frame,which can improve the robust performance in a noise environment. [4] With decadesoptimization and evolution, speech recognition has developed with a quite mature extent thatwidely spread to various of application.1.3 Thesis OutlineThe main goal in this thesis is to use the chosen models for training and processing thesignal, and select appropriate algorithms to analyze the computed data thus to simulate thespeech recognition procedure with different variables based on the researched theory. Andthere are generally 4 sections composite of the report including:Introduction section that describe the general background, history and status quo of Speechrecognition technology.Theories on models of speech recognition technology contains signal pre-processing , whichdescribes a procedure to process signal with endpoint detection, pre-emphasis, framing andwindowing; And then it’s characteristic parameter extraction technology, author mainly usedMel Frequency Cepstral coefficient extraction and related speech recognition algorithm in theexperiment. For analyzed the extracted parameter, Gaussian Mixture Model was utilized.Then it will be the section detailed describing the process of the experiment based on theMatlab. And the testing samples are taken by 3 pairs of words and numbers with differentvariables to assume environmental difference, quantity of samples and syllable of words.Those speech samples were then written into MATLAB program with MFCC characteristicparameter extraction and GMM training model.At last, it will be the discussion of the simulation result, and final conclusions about ouralgorithm will be conducted. The experiment of this algorithm shows that the method in thispaper has relatively good performance. Simultaneously, author discussed the disadvantage ofthe algorithm and some recommendation were also proposed aiming at deficiency in theexperiment.3

Linlin PanResearch and simulation on speech recognition by Matlab1.4 Limitation in experimentSeveral issues still exist in the practical application although GMM model has manyadvantages,1) .The problem of selection of GMM orderThe system recognition rate will be low if the GMM order is too small, and it also generatesvariety of problems such as increase the system computational complexity and the recognitiontime if the order is too large. When the order is bigger than a certain special value, itscontribution to the performance of the system basic is negligible. In this case, it is very hardto select a suitable order. An appropriate GMM order should be selected to balance theperformance and order, but it still may cause the experimental error of the accuracy.2) .The length of training dataIn most time, it is very difficult to obtain enough training data while the training data isinsufficient, the components of covariance matrix will be small. Those small values maygenerate great influence on the performance of the system.3) . The question of orthogonalization of GMMThe covariance matrix of gaussian mixture model is usually a full rank matrix which lead thecalculation work complicated. In practical application, the author will use the diagonal matrixinstead of the original covariance matrix to simplified computational complexity. But in fact,each dimension of Covariance matrix is correlation and conditionality. One solution of thisproblem is transforming the vector into the covariance matrix linearly, which can not onlysimplified the calculation, but also not ignored the characteristic vector of each dimension .4

Linlin PanResearch and simulation on speech recognition by Matlab2 TheoryThe process of speech signal can be divided into the following several stages: firstly, thelanguage information produced in human's brain. Secondly, The human brain convert it intolanguage coding. And then express the language coding with different volume, pitch, timbreand cycle. Once the last information coding completed, other people will hear the soundgenerated by the speakers. Listeners could receive speaker's speech information, and extractthe parameter of speech and analysis the spectrum. And converting the spectrum signal intoexcitation signal of auditory nerve by Neural sensor, then transform the signal to the braincenter by auditory nerve. At last, it’s been converted into language coding. This is mainprocess of speech generating and speech recognition in the physical phase. In this section,theories will be surrounded with how signal can be simulated and recognized in scientificmethod, they will explain the characteristics of speech signals, various pre-processing stepsinvolved in feature extraction, how characteristics of speech were extracted and how tounderstand those characteristics when they are transformed to mathematical coefficient.2.1 Signal samplingA speech signal mainly contains two characteristics:First, signal changes with the time where demonstrates short-time characteristics, whichindicates that signal is stable in a very short period of time. Second, spectrum energy of thehuman’s speech signal normally centralized in frequency between 0-4000Hz. [5]It is an analog signal when speak out from human, and it will convert to a digital signalwhen input into computer, the conversion of this process introduce the most basic theory forsignal processing- signal sampling. It provides principles that the time domain speech analogsignal X(t) convert into the frequency domain discrete time signal X(n) while keepscharacteristics of the original signal in the same time. [5]And to fulfill discretization of thesampling, another theory Nyquist theory is adopted. The theory requires sampling frequencyFs must equal and larger than two times of the highest frequency for sampling and rebuildingthe signal, which can be represented as F 2*Fmax , it offers a way to extract the enoughcharacteristics of the original analog signal in the condition of the least sampling frequency.inthe process of signal sampling. Due to inappropriate high sampling frequency lead tosampling too much data (N T/ t)with a certain length of signal (T), it will increaseunnecessary workload of computer and taken too much storage; On the contrary, the discrete5

Linlin PanResearch and simulation on speech recognition by Matlabtime signal won’t represent the characteristics of the original siganl if the samplingfrequency is too low and the sampling point are insufficient. [5]So we always utilize about 8000Hz as the sampling frequency according to Nyquist Theorythat F 2*Fmax2.2 Signal Pre-processingVoice signal samples into the recognizer to recognize the speech directly, because of thenon-stationary of the speech signal and high redundancy of the samples, thus it is veryimportant to pre-process the speech signal for eliminating redundant information andextracting useful information. The speech signal pre-process step can improve theperformance of speech recognition and enhance recognition robustness .Endpointcheckpre-emphasisframingwindowsFigure 1pre-processing structure [6] [7]It shows that the pre-processing module includes the module of endpoint checking, preemphasis, framing, adding window. Endpoint checking can find the head and tail of usefulsignal, pre-emphasis can reduce the signal dynamic range, framing can divide the speech datainto small chunks, Adding window can improve the frequency spectrum of the signal. [5]2.2.1 Endpoint DetectionEnd point detection is one of a very significant technology in speech recognition and authorsused it as speech signal pre-treatment in the experiment. It can be defined as a technology todetect the start and end point of the target signal so that the testing signal will be moreefficiently utilized for training and analyzing with a rather precise recognition result. [8] Anideal end point detection contains the characteristics in reliability, accuracy, adaptability,simplification, real-time processing and no need for noise pre-testing. [6] Generally, itcontains two methods in end point detection, one is based on entropy-spectral properties andanother is according to double threshold method. The one based on spectral entropy meanseach frame signal is mainly divided into 16 sub-bands, and the selection will be those subbands where distributed in between 250-4000Hz and energy does not exceed 90% of the totalin the frequency spectrum, then it will be the calculation of the energy after speech6

Linlin PanResearch and simulation on speech recognition by Matlabenhancement and the signal-to-noise ratio of each sub-band. The evidence of the end-pointdetection will be based on weighted calculation of whole spectral entropy with different SNRadjustment. This method is effective for improving the detection rate in low SNR noisyenvironment. And the second one, also called double threshold comparison method, it’snormally used for single words detection by comparing the short-time average magnitude ofsignal to short-time average threshold rate. The method is observed by the shape of averagemagnitude, comprehensively judged by short-time average magnitude which is been settled asa higher threshold T1 and lower threshold T2, in the mean while a lower threshold T3 forshort-time average threshold rate [6]In practical experiment, end point detection will be a compiled program that system willaccurately test the start and end point so that to collect the valid data for decreasingprocessing time and data for later use. After endpoint detection, the speech signal stillcontains a large number of redundant information, which need us to extract the usefulcharacteristic parameters and remove the useless information. The model parameters, noisemodel parameters and the adaptive filter parameter are calculated by the corresponding signalsegment. [8] Generally speaking, author will check the endpoint of speech voice by averageenergy or the product of average amplitude value and zero crossing rate with the followingequation . [6]Average energy can be defined as:N 1En w m x n m , 0 m N 12(1) [6]m 0where x(n) is the speech signal, N the length of frame, m is the frame shift, w ( m ) is the 1, m 0 N 1windows function which expressed as w m 0, m otherAdding window for the signal is to avoid truncation effect when framing, so windowing isnecessacery when extract every frames of signal. And it will be more detailed described innext section. [6]Zero crossing rate is another equation been used during the detection, it indicates number oftimes that a frame of speech signal waveform cross throught the horizongtal axis. Zerocrossing analysiss is one of the simplest method in time domain speech analysis. [9]It can be defined as:Zn 1 N 1 sgn x m sgn x m 1 w n m 2 m 07(2) [8]

Linlin PanResearch and simulation on speech recognition by MatlabThe function here is to count the times that sign of signal x changes in the domain of 0 to N-1.Here sgn[ ] is the sign function, which defined as sgn[𝑥] {1, 𝑥 0.Because of energy 1,x 0of the devoiced sound is more concentrated in the high frequency section which makes itszero crossing rate higher than the voiced sound, thus we can use zero crossing rate todistinguish voiced and devoiced sound. [6]Original signalAuthor made an example of double threshold shown in figure2 :It manifests that the Double threshold detecting1endpoints of speech signal, the first figure is ng rate while the third figure is the Energy.1000original signal and the second figure is the Zero00.20.40.60.81times1.21.41.61.82And in the following technique can detect a speechEnergy30voice or not, if Zn ratio(ration is a pre setting2010000.20.40.60.81times1.21.41.61.82Figure 2Double threshold detecting endpoints ofspeech signal, [11]Zero crossing rate) , then it’s a speech signal ,namely , it’s been found the speech head . viceversa, if Zn ratio, then the speech signal is over,which means speech tail will be found. The signal between head and tail is the useful signaland thus the threshold in a big noise environment is adjustable. [8] [9]2.2.2 Pre emphasisThe speech generated from the mouth will loss the information at high frequency, thus itneed the pre emphasis process in order to compensate the high frequency loss. Each frameneed to be emphasized by a high frequency filter. And for speech signal spectrum, the higherthe frequency is, the more serious the loss will be , where requires us do some operation forthe high frequency information, namely the pre emphasis. In the speech signal model, the preemphasis is a 1st order high pass filter. The speech will only remain the track section, it willbe very simple to analysis the speech parameter. [10]The transform function of pre emphasis can be defined as:H ( z ) 1 z 1(3) [10]According to the pre-emphasis function H ( z ) 1 z 1 we got from the literatures, it can theninput the speech signal S(n) into the pre-emphasis module, thus we can got the signal andtransform it:S z S z H ( z)8

Linlin PanResearch and simulation on speech recognition by Matlab S z 1 z 1 S z S z 1 Parameter α is usually between 0 .94 and 0 .97 . [10]Therefore, signal in time domain after pre emphasis can be defined as:S (n) S (n) S (n 1)(4)Based on the theory, the author can make the speech signal spectrum more flat and reduce thesignal dynamic range . Figure3 shows the simulation of pre 1.41.61.8200.20.40.60.81times1.21.41.61.82Pre emphasis210-1Figure 3The pre emphasis of the original signal in time domain, [11]And then do the FFT transform of Pre-emphasis speech signal as Figure 4 shows that afterPre-emphasis, the high frequency part of the speech signal is enhanced obviously. Whichmanifest the meaning of pre-emphasis process to enhance the high frequency section ofspeech signal so that compensate the loss of high frequency for lip eradiation and inherentdecline of speech spectrum, and also eliminate impact of the lip eradiation. [12]OriginalPre 200201000-1.5-1-0.50frequency0.510-1.51.54x 10-1-0.50frequency0.511.54x 10Figure 4The pre-emphasis of the original signal in frequency domain, [11]2.2.3 Frame BlockingThe speech voice belongs to time-varying signal, which means the speech signal is a nolinear signal with time changes. So we can’t use the linear time invariant analysis method toobserve the speech signal. In this case, the author cut the original signal into several small9

Linlin PanResearch and simulation on speech recognition by Matlabpieces of continuous signal, because the speech signal has the characteristic param

speech recognition has acts an important role at present. Using the speech recognition system not only improves the efficiency of the daily life, but also makes people's life more diversified. 1.2 The history and status quo of Speech Recognition The researching of speech recognition technology is started in 1950s. H . Dudley who had

Related Documents:

9/8/11! PSY 719 - Speech! 1! Overview 1) Speech articulation and the sounds of speech. 2) The acoustic structure of speech. 3) The classic problems in understanding speech perception: segmentation, units, and variability. 4) Basic perceptual data and the mapping of sound to phoneme. 5) Higher level influences on perception.

speech 1 Part 2 – Speech Therapy Speech Therapy Page updated: August 2020 This section contains information about speech therapy services and program coverage (California Code of Regulations [CCR], Title 22, Section 51309). For additional help, refer to the speech therapy billing example section in the appropriate Part 2 manual. Program Coverage

Speech Enhancement Speech Recognition Speech UI Dialog 10s of 1000 hr speech 10s of 1,000 hr noise 10s of 1000 RIR NEVER TRAIN ON THE SAME DATA TWICE Massive . Spectral Subtraction: Waveforms. Deep Neural Networks for Speech Enhancement Direct Indirect Conventional Emulation Mirsamadi, Seyedmahdad, and Ivan Tashev. "Causal Speech

speech or audio processing system that accomplishes a simple or even a complex task—e.g., pitch detection, voiced-unvoiced detection, speech/silence classification, speech synthesis, speech recognition, speaker recognition, helium speech restoration, speech coding, MP3 audio coding, etc. Every student is also required to make a 10-minute

1 11/16/11 1 Speech Perception Chapter 13 Review session Thursday 11/17 5:30-6:30pm S249 11/16/11 2 Outline Speech stimulus / Acoustic signal Relationship between stimulus & perception Stimulus dimensions of speech perception Cognitive dimensions of speech perception Speech perception & the brain 11/16/11 3 Speech stimulus

Impromptu Speech 25 2.5% Informative Speech Outline Draft 10 1% Outline Peer Review 10 1% Final Informative Speech Outline 30 3% Speech Rehearsal 25 2.5% Informative Speech 150 15% Attendance/Warm-Up Activities 100 10% Quizzes 110 11% Required Research Credits 30 3% Speech Reflection, Homework, Engagement 50 5%

that, the spectral subtraction algorithm improves speech quality but not speech intelligibility [2]. Consequently, in this research work, the most recent . namely, speech or speaker recognition, speech coding and speech signal enhancement. By using only a few wavelet coefficients, it is possible to obtain a

ANIMAL NUTRITION Tele-webconference, 27 November, 10 and 11 December 2020 (Agreed on 17 December 2020) Participants Working Group Members:1 Vasileios Bampidis (Chair), Noël Dierick, Jürgen Gropp, Maryline Kouba, Marta López-Alonso, Secundino López Puente, Giovanna Martelli, Alena Pechová, Mariana Petkova and Guido Rychen Hearing Experts: Not Applicable European Commission and/or Member .