Introduction To Digital Speech Processing

3y ago
30 Views
2 Downloads
3.28 MB
42 Pages
Last View : 3m ago
Last Download : 3m ago
Upload by : Helen France
Transcription

Digital Speech Processing—Lecture 1Introduction to DigitalSpeech Processing1

Speech Processing Speech is the most natural form of human-human communications.Speech is related to language; linguistics is a branch of socialscience.Speech is related to human physiological capability; physiology is abranch of medical science.Speech is also related to sound and acoustics, a branch of physicalscience.Therefore, speech is one of the most intriguing signals that humanswork with every day.Purpose of speech processing:– To understand speech as a means of communication;– To represent speech for transmission and reproduction;– To analyze speech for automatic recognition and extraction ofinformation– To discover some physiological characteristics of the talker.2

Why Digital Processing of Speech? digital processing of speech signals (DPSS)enjoys an extensive theoretical andexperimental base developed over the past 75years much research has been done since 1965 onthe use of digital signal processing in speechcommunication problems highly advanced implementation technology(VLSI) exists that is well matched to thecomputational demands of DPSS there are abundant applications that are inwidespread use commercially3

The Speech StackSpeech Applications — coding, synthesis,recognition, understanding, verification,language translation, speed-up/slow-downSpeech Algorithms —speech-silence(background), voiced-unvoiced decision,pitch detection, formant estimationSpeech Representations — temporal,spectral, homomorphic, LPCFundamentals — acoustics, linguistics,pragmatics, speech perception4

Speech Applications We look first at the top of the speechprocessing stack—namelyapplications– speech coding– speech synthesis– speech recognition and understanding– other speech applications5

Speech CodingEncodingspeechxc (t )A-to-DConverterContinuoustime signalx[n ]Analysis/CodingSampledsignaly[n ]yˆ [n ]CompressionTransformedrepresentationdatayˆ [n ]ChannelorMediumBit ˆc c(t()t )6

Speech Coding Speech Coding is the process of transforming aspeech signal into a representation for efficienttransmission and storage of speech– narrowband and broadband wired telephony– cellular communications– Voice over IP (VoIP) to utilize the Internet as a real-timecommunications medium– secure voice for privacy and encryption for nationalsecurity applications– extremely narrowband communications channels, e.g.,battlefield applications using HF radio– storage of speech for telephone answering machines,IVR systems, prerecorded messages7

Demo of Speech Coding Narrowband Speech Coding: 64 kbps PCM 32 kbps ADPCM 16 kbps LDCELP 8 kbps CELP 4.8 kbps FS1016 2.4 kbps LPC10ENarrowband Speech Wideband Speech Coding:Male talker / Female Talker 3.2 kHz – uncoded 7 kHz – uncoded 7 kHz – 64 kbps 7 kHz – 32 kbps 7 kHz – 16 kbpsWideband Speech8

Demo of Audio Coding CD Original (1.4 Mbps) versus MP3-coded at 128 kbps¾ female vocal¾ trumpet selection¾ orchestra¾ baroque¾ guitarCan you determine which is the uncoded and which is thecoded audio for each selection?Audio CodingAdditional Audio Selections9

Audio Coding Female vocal – MP3-128 kbps coded, CDoriginal Trumpet selection – CD original, MP3-128kbps coded Orchestral selection – MP3-128 kbpscoded Baroque – CD original, MP3-128 kbpscoded Guitar – MP3-128 kbps coded, CD original10

Speech rterspeech11

Speech Synthesis Synthesis of Speech is the process ofgenerating a speech signal usingcomputational means for effective humanmachine interactions– machine reading of text or email messages– telematics feedback in automobiles– talking agents for automatic transactions– automatic agent in customer care call center– handheld devices such as foreign languagephrasebooks, dictionaries, crossword puzzlehelpers– announcement machines that provideinformation such as stock quotes, airlinesschedules, weather reports, etc.12

Speech Synthesis Examples Soliloquy from Hamlet: Gettysburg Address: Third Grade Story:1964-lrr2002-tts13

Pattern Matching ProblemsspeechA-to-DConverter speechrecognition speaker ferencePatterns speaker verification word spotting automatic indexing of speech recordings14

Speech Recognition and Understanding Recognition and Understanding of Speech isthe process of extracting usable linguisticinformation from a speech signal in support ofhuman-machine communication by voice– command and control (C&C) applications, e.g., simplecommands for spreadsheets, presentation graphics,appliances– voice dictation to create letters, memos, and otherdocuments– natural language voice dialogues with machines toenable Help desks, Call Centers– voice dialing for cellphones and from PDA’s and othersmall devices– agent services such as calendar entry and update,15address list modification and entry, etc.

Speech Recognition Demos16

Speech Recognition Demos17

Dictation Demo18

Other Speech Applications Speaker Verification for secure access to premises,information, virtual spaces Speaker Recognition for legal and forensic purposes—national security; also for personalized services Speech Enhancement for use in noisy environments, toeliminate echo, to align voices with video segments, tochange voice qualities, to speed-up or slow-downprerecorded speech (e.g., talking books, rapid review ofmaterial, careful scrutinizing of spoken material, etc) potentially to improve intelligibility and naturalness ofspeech Language Translation to convert spoken words in onelanguage to another to facilitate natural languagedialogues between people speaking different languages,i.e., tourists, business people19

DSP/Speech Enabled DevicesInternet AudioDigital CamerasPDAs & StreamingAudio/VideoHearing AidsCell Phones20

Apple iPod stores music in MP3, AAC, MP4,wma, wav, audio formats compression of 11-to-1 for 128 kbpsMP3 can store order of 20,000 songs with30 GB disk can use flash memory to eliminate allmoving memory access can load songs from iTunes store –more than 1.5 billion downloads tens of millions soldMemoryx[n]Computery[n]D-to-Ayc(t)21

One of the Top DSP ApplicationsCellular Phone22

Digital Speech Processing Need to understand the nature of the speechsignal, and how dsp techniques, communicationtechnologies, and information theory methodscan be applied to help solve the variousapplication scenarios described above– most of the course will concern itself with speechsignal processing — i.e., converting one type ofspeech signal representation to another so as touncover various mathematical or practical propertiesof the speech signal and do appropriate processing toaid in solving both fundamental and deep problems ofinterest23

Speech Signal ProductionMessageSourceMIdeaencapsulatedin amessage, age, M,realized as awordsequence, WSWords realizedas a sequenceof (phonemic)sounds, SConventional studies ofspeech science use speechsignals recorded in a soundbooth with little interference ordistortionAcousticPropagationASoundsreceived atthetransducerthroughacousticambient, AElectronicTransductionSpeechWaveformXSignals convertedfrom acoustic toelectric,transmitted,distorted andreceived as XPractical applicationsrequire use of realistic or“real world” speech withnoise and distortions24

Speech Production/Generation Model Message Formulation Æ desire to communicate an idea, a wish, arequest, express the message as a sequence of wordsDesire toCommunicate MessageFormulationText StringI need some stringPlease get me some stringWhere can I buy somestring(Discrete Symbols)Language Code Æ need to convert chosen text string to asequence of sounds in the language that can be understood byothers; need to give some form of emphasis, prosody (tune, melody)to the spoken sounds so as to impart non-speech information suchas sense of urgency, importance, psychological state of talker,environmental factors (noise, echo)Text StringLanguageCodeGeneratorPhoneme stringwith prosodyPronunciation (In The Brain)Vocabulary(Discrete Symbols)25

Speech Production/Generation Model Neuro-Muscular Controls Æ need to direct the neuro-muscularsystem to move the articulators (tongue, lips, teeth, jaws, velum) soas to produce the desired spoken message in the desired mannerPhoneme Stringwith Prosody s control)Vocal Tract System Æ need to shape the human vocal tract systemand provide the appropriate sound sources to create an acousticwaveform (speech) that is understandable in the environment inwhich it is spokenArticulatoryMotionsVocal TractSystemSource control (lungs,diaphragm, chestmuscles)AcousticWaveform(Speech)(Continuous control)26

The Speech SignalBackgroundSignalPitch PeriodUnvoiced Signal (noiselike sound)27

Speech Perception Model The acoustic waveform impinges on the ear (the basilar membrane)and is spectrally analyzed by an equivalent filter bank of the earAcousticWaveform (Continuous Control)NeuralTransductionSound ontrol)The brain decodes the feature stream into sounds, words andsentencesSound Features SpectralRepresentationThe signal from the basilar membrane is neurally transduced andcoded into features that can be decoded by the brainSpectralFeatures ords, andSentences(Discrete Message)The brain determines the meaning of the words via a messageunderstanding mechanismPhonemes,Words andSentencesMessageUnderstandingBasic Message(Discrete Message)28

The Speech ChainPhonemes, Prosody Articulatory MotionsTextMessageFormulationLanguageCodeDiscrete Input50 bps200 bpsNeuro-MuscularControlsVocal TractSystem2000 bps30-50kbpsTransmissionChannelInformation ontinuous screte nContinuous Output29

The Speech Chain30

Speech Sciences Linguistics: science of language, including phonetics,phonology, morphology, and syntax Phonemes: smallest set of units considered to be thebasic set of distinctive sounds of a languages (20-60units for most languages) Phonemics: study of phonemes and phonemic systems Phonetics: study of speech sounds and their production,transmission, and reception, and their analysis,classification, and transcription Phonology: phonetics and phonemics together Syntax: meaning of an utterance31

The Speech CircleVoice reply to customerCustomer voice request“What number did youwant to call?”Text-to-SpeechSynthesisTTSASRAutomatic SpeechRecognitionDataWhat’s next?Words spoken“Determine correct number”“I dialed a wrong number”DialogManagement(Actions) andSpokenLanguageGeneration(Words)DM &SLGSLUSpoken LanguageUnderstandingMeaning“Billing credit”32

Information Rate of Speech from a Shannon view of information:– message content/information--2**6 symbols(phonemes) in the language; 10 symbols/sec fornormal speaking rate 60 bps is the equivalentinformation rate for speech (issues of phonemeprobabilities, phoneme correlations) from a communications point of view:– speech bandwidth is between 4 (telephone quality)and 8 kHz (wideband hi-fi speech)—need to samplespeech at between 8 and 16 kHz, and need about 8(log encoded) bits per sample for high qualityencoding 8000x8 64000 bps (telephone) to16000x8 128000 bps (wideband)1000-2000 times change in information rate from discrete messagesymbols to waveform encoding can we achieve this three orders ofmagnitude reduction in information rate on real speech waveforms? 33

InformationSourceHuman speaker—lots ofvariabilityMeasurement orObservationAcoustic waveform/articulatorypositions/neural control signalsSignalRepresentationSignalProcessingPurpose ofCourseSignalTransformationExtraction andUtilization ofInformationHuman listeners,machines34

Digital Speech Processing DSP:– obtaining discrete representations of speech signal– theory, design and implementation of numerical procedures(algorithms) for processing the discrete representation in order toachieve a goal (recognizing the signal, modifying the time scaleof the signal, removing background noise from the signal, etc.) Why yreal-time implementations on inexpensive dsp chipsability to integrate with multimedia and dataencryptability/security of the data and the data representationsvia suitable techniques35

Hierarchy of Digital Speech ProcessingRepresentation ofSpeech SignalsWaveformRepresentationspreserve wave shapethrough sampling Parameterspitch, voiced/unvoiced,noise, transientsrepresentsignal asoutput of aspeechproductionmodelVocal TractParametersspectral, articulatory36

Information Rate of SpeechData Rate (Bits Per Second)200,00060,00020,000LDM, PCM, DPCM, m PrintedText(No Source Coding)(Source tions37

Speech Processing ,encryption,secrecy,seamless voiceand dataMessages,IVR, ion,commandandcontrol,agents, NLvoicedialogues,callcenters,help desksReadingsfor theblind,speed-upand slowdown ofspeechratesNoise andechoremoval,alignment ofspeech andtext38

The Speech Stack

Intelligent Robot?http://www.youtube.com/watch?v uvcQCJpZJH840

Speak 4 It (AT&T Labs)Courtesy: Mazin Rahim41

What We Will Be Learning review some basic dsp conceptsspeech production model—acoustics, articulatory concepts, speechproduction modelsspeech perception model—ear models, auditory signal processing,equivalent acoustic processing modelstime domain processing concepts—speech properties, pitch, voicedunvoiced, energy, autocorrelation, zero-crossing ratesshort time Fourier analysis methods—digital filter banks, spectrograms,analysis-synthesis systems, vocodershomomorphic speech processing—cepstrum, pitch detection, formantestimation, homomorphic vocoderlinear predictive coding methods—autocorrelation method, covariancemethod, lattice methods, relation to vocal tract modelsspeech waveform coding and source models—delta modulation, PCM,mu-law, ADPCM, vector quantization, multipulse coding, CELP codingmethods for speech synthesis and text-to-speech systems—physicalmodels, formant models, articulatory models, concatenative modelsmethods for speech recognition—the Hidden Markov Model (HMM)42

Digital Speech Processing Need to understand the nature of the speech signal, and how dsp techniques, communication technologies, and information theory methods can be applied to help solve the various application scenarios described above – most of the course will concern itself with speech signal processing — i.e., converting one type of

Related Documents:

Lecture 1 Introduction to Digital Speech Processing 2 Speech Processing Speech is the most natural form of human-human communications. Speech is related to language; linguistics is a branch of social science. Speech is related to human physiological capability; physiology is a branch of medical science.

speech or audio processing system that accomplishes a simple or even a complex task—e.g., pitch detection, voiced-unvoiced detection, speech/silence classification, speech synthesis, speech recognition, speaker recognition, helium speech restoration, speech coding, MP3 audio coding, etc. Every student is also required to make a 10-minute

The complete set of MATLAB Speech Processing Apps is made available to students and instructors via MATLAB Central, File Exchange, on the MathWorks website, including: -all the code that is required to run the complete set of Speech Processing Apps -an extensive set of speech and audio files for processing

Springer Handbook on Speech Processing and Speech Communication 1 NONLINEAR COCHLEAR SIGNAL PROCESSING AND MASKING IN SPEECH PERCEPTION Jont B. Allen University of IL Urbana IL 1. INTRODUCTION Auditory masking is critical to our understanding of speech andmusic processing. Thereare manycla

speech 1 Part 2 – Speech Therapy Speech Therapy Page updated: August 2020 This section contains information about speech therapy services and program coverage (California Code of Regulations [CCR], Title 22, Section 51309). For additional help, refer to the speech therapy billing example section in the appropriate Part 2 manual. Program Coverage

9/8/11! PSY 719 - Speech! 1! Overview 1) Speech articulation and the sounds of speech. 2) The acoustic structure of speech. 3) The classic problems in understanding speech perception: segmentation, units, and variability. 4) Basic perceptual data and the mapping of sound to phoneme. 5) Higher level influences on perception.

1 11/16/11 1 Speech Perception Chapter 13 Review session Thursday 11/17 5:30-6:30pm S249 11/16/11 2 Outline Speech stimulus / Acoustic signal Relationship between stimulus & perception Stimulus dimensions of speech perception Cognitive dimensions of speech perception Speech perception & the brain 11/16/11 3 Speech stimulus

Studi Pendidikan Akuntansi secara keseluruhan adalah sebesar Rp4.381.147.409,46. Biaya satuan pendidikan (unit cost) pada Program Studi Akuntansi adalah sebesar Rp8.675.539,42 per mahasiswa per tahun. 2.4 Kerangka Berfikir . Banyaknya aktivitas-aktivitas yang dilakukan Fakultas dalam penyelenggaraan pendidikan, memicu biaya-biaya dalam penyelenggaraan pendidikan. Biaya dalam pendidikan .