Time Domain Methods In Speech Processing

3y ago
52 Views
3 Downloads
7.74 MB
73 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Javier Atchley
Transcription

Digital Speech Processing—Lectures 7-8Time Domain Methodsin Speech Processing1

General Synthesis Modelvoiced soundamplitudeT1Log Areas, ReflectionCoefficients, Formants, VocalTract Polynomial, ArticulatoryParameters, T2R( z ) 1 α z 1unvoiced soundamplitudePitch Detection, Voiced/Unvoiced/Silence Detection, Gain Estimation, Vocal Tract2Parameter Estimation, Glottal Pulse Shape, Radiation Model

General Analysis Models[n]SpeechAnalysisModelPitch Period, T[n]Glottal Pulse Shape, g[n]Voiced Amplitude, AV[n]V/U/S[n] SwitchUnvoiced Amplitude, AU[n]Vocal Tract IR, v[n]Radiation Characteristic, r[n] All analysis parameters are time-varying at ratescommensurate with information in the parameters; We need algorithms for estimating the analysisparameters and their variations over time3

Overviewspeech, x[n]SignalProcessingrepresentationof speechspeech or musicA(x,t)formantsreflection coefficientsvoiced-unvoiced-silencepitchsounds of languagespeaker identificationemotions time domain processing direct operations on the speech waveform frequency domain processing direct operations on a spectralrepresentation of the signalx[n]systemzero crossing ratelevel crossing rateenergyautocorrelation simple processing enables various types of feature estimation4

BasicsVP1P2 8 kHz sampled speech (bandwidth 4 kHz) properties of speech change withtimeU/S excitation goes from voiced tounvoiced peak amplitude varies with thesound being produced pitch varies within and acrossvoiced sounds periods of silence wherebackground signals are seen the key issue is whether we cancreate simple time-domain processingmethods that enable us tomeasure/estimate speechrepresentations reliably and accurately5

Fundamental Assumptions properties of the speech signal change relativelyslowly with time (5-10 sounds per second)– over very short (5-20 msec) intervals uncertaintydue to small amount of data, varying pitch, varyingamplitude– over medium length (20-100 msec) intervals uncertainty due to changes in sound quality,transitions between sounds, rapid transients inspeech– over long (100-500 msec) intervals uncertaintydue to large amount of sound changes there is always uncertainty in short timemeasurements and estimates from speechsignals6

Compromise Solution “short-time” processing methods short segments ofthe speech signal are “isolated” and “processed” as ifthey were short segments from a “sustained” sound withfixed (non-time-varying) properties– this short-time processing is periodically repeated for theduration of the waveform– these short analysis segments, or “analysis frames” almostalways overlap one another– the results of short-time processing can be a single number (e.g.,an estimate of the pitch period within the frame), or a set ofnumbers (an estimate of the formant frequencies for the analysisframe)– the end result of the processing is a new, time-varying sequencethat serves as a new representation of the speech signal7

Frame-by-Frame Processingin Successive WindowsFrame 1Frame 2Frame 3Frame 4Frame 575% frame overlap frame length L, frame shift R L/4Frame1 {x[0],x[1], ,x[L-1]}Frame2 {x[R],x[R 1], ,x[R L-1]}Frame3 {x[2R],x[2R 1], ,x[2R L-1]} 8

Frame 1: samples 0,1,., L 1Frame 2: samples R, R 1,., R L 1Frame 3: samples 2 R, 2 R 1,., 2 R L 1Frame 4: samples 3R,3R 1,.,3R L 1

Frame-by-Frame Processing inSuccessive WindowsFrame 1Frame 250% frame overlapFrame 3Frame 4 Speech is processed frame-by-frame in overlapping intervals until entireregion of speech is covered by at least one such frame Results of analysis of individual frames used to derive model parameters insome manner Representation goes from time sample x[n], n L ,0, 1, 2, L to parametervector f [m], m 0, 1, 2, L where n is the time index and m is the frame index.10

Frames and WindowsFS 16, 000 samples/secondL 641 samples (equivalent to 40 msec frame (window) length)R 240 samples (equivalent to 15 msec frame (window) shift)Frame rate of 66.7 frames/second11

Short-Time Processingspeechwaveform, x[n]short-timeprocessingspeech representation,f[m]x [n ] samples at 8000/sec rate; (e.g. 2 seconds of 4 kHz bandlimitedspeech, x [n ], 0 n 16000)rf [m] {f 1 [m], f2 [m],., fL [m]} vectors at 100/sec rate, 1 m 200,L is the size of the analysis vector (e.g., 1 for pitch period estimate, 12 forautocorrelation estimates, etc)12

Generic Short-Time Processing Qnˆ T ( x[m]) w% [n m] m n nˆx[n]T( )linear or non-lineartransformationT(x[n])w[n] Qn̂window sequence(usually finite length) Qn̂ is a sequence of local weighted averagevalues of the sequence T(x[n]) at time n nˆ13

Short-Time EnergyE x 2 [m ]m -- this is the long term definition of signal energy-- there is little or no utility of this definition for time-varying signalsEnˆ nˆ x 2 [m] x 2 [nˆ L 1] . x 2 [nˆ ]m nˆ L 1-- short-time energy in vicinity of time nˆT (x) x2w% [n ] 1 00 n L 1otherwise14

Computation of Short-Time Energyw% [n m]n L 1n window jumps/slides across sequence of squared values, selecting intervalfor processing what happens to En̂ as sequence jumps by 2,4,8,.,L samples ( En̂ is a lowpassfunction—so it can be decimated without lost of information; why is En̂ lowpass?) effects of decimation depend on L; if L is small, then En̂ is a lot more variablethan if L is large (window bandwidth changes with L!)15

Effects of WindowQnˆ T ( x [n ]) w% [n ] x ′[n ] w% [n ]n nˆn nˆ w% [n] serves as a lowpass filter on T ( x[n]) which often has a lot ofhigh frequencies (most non-linearities introduce significant highfrequency energy—think of what ( x[n] x[n] ) does in frequency) often we extend the definition of Qn̂ to include a pre-filtering termso that x[n] itself is filtered to a region of interestxˆ[n]LinearFilterx[n]T ( x[n])T( )Qnˆ Qnw% [n]n nˆ16

Short-Time Energy serves to differentiate voiced and unvoiced sounds in speechfrom silence (background signal)natural definition of energy of weighted signal is:Enˆ 2m x [m] w% [nˆ m] (sum or squares of portion of signal)-- concentrates measurement at sample nˆ, using weighting w% [nˆ - m]Enˆ m x [m] w% [nˆ m] 22 x 2 [m] h[nˆ m ]m h[n ] w% 2 [n ]short time energyx[n]FS()2x2[n]FSh[n]Enˆ Enn nˆFS / R17

Short-Time Energy Properties depends on choice of h[n], or equivalently, window w[n]– if w[n] duration very long and constant amplitude (w[n] 1,n 0,1,.,L-1), En would not change much overtime, and would not reflect the short-time amplitudes ofthe sounds of the speech– very long duration windows correspond to narrowbandlowpass filters– want En to change at a rate comparable to the changingsounds of the speech this is the essential conflict inall speech processing, namely we need short durationwindow to be responsive to rapid sound changes, butshort windows will not provide sufficient averaging togive smooth and reliable energy function18

Windows consider two windows, w[n]– rectangular window: h[n] 1, 0 n L-1 and 0 otherwise– Hamming window (raised cosine window): h[n] 0.54-0.46 cos(2πn/(L-1)), 0 n L-1 and 0 otherwise– rectangular window gives equal weight to all Lsamples in the window (n,.,n-L 1)– Hamming window gives most weight to middlesamples and tapers off strongly at the beginning andthe end of the window19

Rectangular and Hamming WindowsL 21 samples20

Window Frequency Responses rectangular windowH (ej ΩTsin(ΩLT / 2) j ΩT ( L 1)/ 2) esin(ΩT / 2) first zero occurs at f Fs/L 1/(LT) (or Ω (2π)/(LT)) nominal cutoff frequency of the equivalent “lowpass” filter Hamming windoww% H [n] 0.54w% R [n] 0.46*cos(2π n / ( L 1)) w% R [n] can decompose Hamming Window FR into combinationof three terms21

RW and HW Frequency Responses log magnitude response of RW and HW bandwidth of HW is approximately twicethe bandwidth of RW attenuation of more than 40 dB for HWoutside passband, versus 14 dB for RW stopband attenuation is essentiallyindependent of L, the window duration increasing L simply decreases windowbandwidth L needs to be larger than a pitch period(or severe fluctuations will occur in En), butsmaller than a sound duration (or En willnot adequately reflect the changes in thespeech signal)There is no perfect value of L, since a pitch period can be as short as 20 samples (500 Hz at a 10 kHzsampling rate) for a high pitch child or female, and up to 250 samples (40 Hz pitch at a 10 kHz samplingrate) for a low pitch male; a compromise value of L on the order of 100-200 samples for a 10 kHz sampling22rate is often used in practice

Window Frequency ResponsesRectangular Windows,L 21,41,61,81,101Hamming Windows,L 21,41,61,81,10123

Short-Time EnergyShort-time energy computation:Enˆ ( x[m]w[nˆ m])2m ( x[m]) m 2w% [nˆ m]For L-point rectangular window,w% [m] 1, m 0,1,., L 1givingEnˆ nˆ m nˆ L 1( x[m]) 224

Short-Time Energy using RW/HWEn̂En̂L 51L 51L 101L 101L 201L 201L 401L 401 as L increases, the plots tend to converge (however you are smoothing sound energies) short-time energy provides the basis for distinguishing voiced from unvoiced speechregions, and for medium-to-high SNR recordings, can even be used to find regions ofsilence/background signal25

Short-Time Energy for AGCCan use an IIR filter to define short-time energy, e.g., time-dependent energy definitionσ [n ] 2 x [m]h[n m] / h[m]2m m 0 consider impulse response of filter of formh[n ] α n 1u[n 1] α n 1 n 1 0σ [n ] 2 n 1(1 α ) x 2 [m]α n m 1u[n m 1]m 26

Recursive Short-Time Energyu[n m 1] implies the condition n m 1 0or m n 1 givingσ [n ] 2n 1 (1 α ) x 2 [m] α n m 1 (1 α )( x 2 [n 1] α x 2 [n 2] .)m for the index n 1 we haveσ [n 1] 2n 2 (1 α ) x 2 [m] α n m 2 (1 α )( x 2 [n 2] α x 2 [n 3] .)m thus giving the relationshipσ 2 [n ] α σ 2 [n 1] x 2 [n 1](1 α )and defines an Automatic Gain Control (AGC) of the formG0G[ n ] σ [n ]27

Recursive Short-Time Energyσ 2 [n ] x 2 [n ] h[n ]h[n ] (1 α )α n 1u[n 1]σ 2 (z) X 2 (z) H (z)H (z) h[n ] z n n 1 n(1 α)αu[n 1]z n n (1 α )α n 1 z nn 1m n 1H (z) (1 α )αmz ( m 1)m 0 (1 α ) z 1 (1 α ) z 1α m z mm 0 α zm m (1 α ) zm 0σ 2 [n ] ασ 2 [n 1] (1 α ) x 2 (n 1) 111 α z 1 σ 2 (z) / X 2 (z)28

Recursive Short-Time Energyx[n]( )2 x 2 [ n]z 1(1 α )σ 2 [ n] σ 2 [n 1]z 1ασ [n ] α σ [n 1] x [n 1](1 α )22229

Recursive Short-Time Energy30

Use of Short-Time Energy for AGC31

Use of Short-Time Energy for AGCα 0.9α 0.9932

Short-Time Magnitude short-time energy is very sensitive to largesignal levels due to x2[n] terms– consider a new definition of ‘pseudo-energy’ basedon average signal magnitude (rather than energy)M nˆ x[m] w% [nˆ m]m – weighted sum of magnitudes, rather than weightedsum of squaresx[n]FS x[n] FSw% [n]Mnˆ Mn n nˆFS / R computation avoids multiplications of signal with itself (the squared term)33

Short-Time MagnitudesM n̂M n̂L 51L 51L 101L 101L 201L 201L 401L 401 differences between En and Mn noticeable in unvoiced regions dynamic range of Mn square root (dynamic range of En) level differences between voiced andunvoiced segments are smaller En and Mn can be sampled at a rate of 100/sec for window durations of 20 msec or so efficientrepresentation of signal energy/magnitude34

Short Time Energy and Magnitude—Rectangular WindowEn̂M n̂L 51L 51L 101L 101L 201L 201L 401L 40135

Short Time Energy and Magnitude—Hamming WindowEn̂M n̂L 51L 51L 101L 101L 201L 201L 401L 40136

Other Lowpass Windows can replace RW or HW with any lowpass filerwindow should be positive since this guarantees En and Mn will bepositiveFIR windows are efficient computationally since they can slide by Rsamples for efficiency with no loss of information (what should Rbe?)can even use an infinite duration window if its z-transform is arational function, i.e.,h[n ] a n , n 0, 0 a 1 0n 01H (z) 1 az 1 z a 37

Other Lowpass Windows this simple lowpass filter can be used toimplement En and Mn recursively as:En a En 1 (1 a )x 2 [n ] short-time energyMn a Mn 1 (1 a ) x [n ] short-time magnitude need to compute En or Mn every sample and thendown-sample to 100/sec rate recursive computation has a non-linear phase,so delay cannot be compensated exactly38

Short-Time Average ZC Ratezero crossing successive sampleshave different algebraic signszero crossings zero crossing rate is a simple measure of the ‘frequency content’ of asignal—especially true for narrowband signals (e.g., sinusoids) sinusoid at frequency F0 with sampling rate FS has FS/F0 samples percycle with two zero crossings per cycle, giving an average zerocrossing rate ofz1 (2) crossings/cycle x (F0 / FS) cycles/samplez1 2F0 / FS crossings/sample (i.e., z1 proportional to F0 )zM M (2F0 /FS) crossings/(M samples)39

Sinusoid Zero Crossing RatesAssume the sampling rate is FS 10, 000 Hz1. F0 100 Hz sinusoid has FS / F0 10, 000 / 100 100 samples/cycle;or z1 2 / 100 crossings/sample, or z100 2 / 100 * 100 2 crossings/10 msec interval2. F0 1000 Hz sinusoid has FS / F0 10, 000 / 1000 10 samples/cycle;or z1 2 / 10 crossings/sample, or z100 2 / 10 * 100 20 crossings/10 msec interval3. F0 5000 Hz sinusoid has FS / F0 10, 000 / 5000 2 samples/cycle;or z1 2 / 2 crossings/sample, or z100 2 / 2 * 100 100 crossings/10 msec interval40

Zero Crossing for Sinusoidsoffset:0.75, 100 Hz sinewave, ZC:9, offset sinewave, ZC:81100 Hz sinewave0.5ZC 90-0.5-1050100150200Offset 0.75100 Hz sinewave with dc offset1.5ZC 810.5005010015020041

Zero Crossings for Noiseoffseet:0.75, random noise, ZC:252, offset noise, ZC:1223random gaussian noise2ZC 25210-1-2050100150200Offset 0.756random gaussian noise with dc offset4ZC 12220-205010015020025042

ZC Rate Definitions1Znˆ 2Leffnˆ m nˆ L 1 sgn( x[m]) sgn( x[m 1]) w% [nˆ m]x [n ] 0sgn( x [n ]) 1 1 x[n ] 0simple rectangular window:w% [n ] 10 n L 1 0otherwiseLeff LSame form for Z nˆ as for Enˆ or M nˆ43

ZC NormalizationThe formal definition of Znˆ is:nˆ1Znˆ z1 sgn( x[m]) sgn( x[m 1]) 2L m nˆ L 1is interpreted as the number of zero crossings per sample.For most practical applications, we need the rate of zero crossingsper fixed interval of M samples, which iszM z1 M rate of zero crossings per M sample intervalThus, for an interval of τ sec., corresponding to M samples we getzM z1 M; M τ FS τ / TFS 10, 000 Hz; T 100 μ sec; τ 10 msec; M 100 samplesFS 8, 000 Hz; T 125 μ sec; τ 10 msec; M 80 samplesFS 16, 000 Hz; T 62.5 μ sec; τ 10 msec; M 160 samplesZero crossings/10 msec interval as a function of sampling rate44

ZC NormalizationFor a 1000 Hz sinewave as input, using a 40 msec window length(L ), with various values of sampling rate (FS ), we get the following:FSLz1MzM800010000160003204006401/ 41/ 51/ 880100160202020Thus we see that the normalized (per interval) zero crossing rate,zM , is independent of the sampling rate and can be used as a measureof the dominant energy in a band.45

ZC and Energy ComputationHamming windowwith durationL 201 samples(12.5 msec atFs 16 kHz)Hamming windowwith durationL 401 samples(25 msec atFs 16 kHz)46

ZC Rate DistributionsUnvoiced Speech:the dominant energycomponent is atabout 2.5 kHz1 KHz2KHz3KHz4KHzVoiced Speech: thedominant energycomponent is atabout 700 Hz for voiced speech, energy is mainly below 1.5 kHz for unvoiced speech, energy is mainly above 1.5 kHz mean ZC rate for unvoiced speech is 49 per 10 msec interval mean ZC rate for voiced speech is 14 per 10 msec interval47

ZC Rates for Speech 15 msecwindows 100/secsampling rate onZC computation48

Short-Time Energy, Magnitude, ZC49

Issues in ZC Rate Computation for zero crossing rate to be accurate, need zeroDC in signal need to remove offsets, hum,noise use bandpass filter to eliminate DC andhum can quantize the signal to 1-bit for computationof ZC rate can apply the concept of ZC rate to bandpassfiltered speech to give a ‘crude’ spectral estimatein narrow bands of speech (kind of gives anestimate of the strongest frequency in eachnarrow band of speech)50

Summary of Simple Time Domain Measuress ( n)LinearFilterQnˆ x ( n)T[ ]T ( x [n])w% [n]Qn̂ T ( x[m])w% [nˆ m]m 1. Energy:Enˆ nˆ m nˆ L 1x 2 [m] w% [nˆ m]can downsample Enˆ at rate commensurate with window bandwidth2. Magnitude:Mnˆ nˆ m nˆ L 1x [m] w% [nˆ m]3. Zero Crossing Rate:nˆ1sgn( x [m]) sgn( x [m 1]) w% [nˆ m]Znˆ z1 2L m nˆ L 1where sgn( x [m]) 1 x [m] 0 1 x [m] 051

Short-Time Autocorrelation-for a deterministic signal, the autocorrelation function is defined as:Φ[ k ] x [m ] x [m k ]m -for a random or periodic signal, the autocorrelation function is:L1Φ[k ] limx [m ] x [m k ] L ( 2L 1)m L- if x [n ] x [n P ], then Φ[k ] Φ[k P ], the autocorrelation functionpreserves periodicity-properties of Φ[k ] :1. Φ[k ] is even, Φ[k ] Φ[ k ]2. Φ[k ] is maximum at k 0, Φ[k ] Φ[0], k3. Φ[0] is the signal energy or power (for random signals)52

Periodic Signals for a periodic signal we have (at least intheory) Φ[P] Φ[0] so the period of aperiodic signal can be estimated as thefirst non-zero maximum of Φ[k]– this means that the autocorrelation function isa good candidate for speech pitch detectionalgorithms– it also means that we need a good way ofmeasuring the short-time autocorrelationfunction for speech signals53

Short-Time Autocorrelation- a reasonable definition for the short-time autocorrelation is:Rnˆ [k ] m x [m] w% [nˆ m] x [m k ] w% [nˆ k m]1. select a segment of speech by windowing2. compute deterministic autocorrelation of the windowed speech- symmetryRnˆ [k ] Rnˆ [ k ] m x [m ] x [m k ] w% [nˆ m]w% [nˆ k m] - define filter of the formw% k [nˆ ] w% [nˆ ] w% [nˆ k ]- this enables us to write the short-time autocorrelation in the form:Rnˆ [k ] m x [m] x [m k ] w% k [nˆ m ]- the value of w% nˆ [k ] at time nˆ for the k th lag is obtained by filteringthe sequence x[nˆ ] x[nˆ k ] with a filter with impulse response w% k [nˆ ]54

Short-Time AutocorrelationRnˆ [k ] x[m]w% [nˆ m] x[m k ]w% [nˆ k m] m n-L 1n k-L 1 L points used to compute Rnˆ [0]; L k points used to compute Rnˆ [k ];55

Short-Time Autocorrelation56

Examples of Autocorrelations autocorrelation peaks occur at k 72, 144, . 140 Hz pitch much less clear estimates of periodicity since Φ(P) Φ(0) since windowed speech is not perfectly periodicHW tapers signal so strongly, making it look likea non-periodic signal57 no strong peak for unvoiced speech over a 401 sample window (40 msec of signal), pitch periodchanges occur, so P is not perfectly defined

Voiced (female) L 401 (magnitude)T0 N 0Tt nTT01F0 T0F / Fs58

Voiced (female) L 401 (log mag)T0t nTT01F0 T0F / Fs59

Voiced (male) L 401T0T033F0 T060

Unvoiced L 40161

Unvoiced L 40162

Effects of Window Size choice of L, window duration small L so pitch periodalmost constant in windowL 401L 251L 125 large L so clear periodicityseen in window as k increases, the numberof window points decrease,reducing the accuracy andsize of Rn(k) for large k have a taper of the typeR(k) 1-k/L, k L shaping ofautocorrelation (this is theautocorrelation of size Lrectangular window) allow L to vary with detectedpitch periods (so that at least 2 fullperiods are included)63

Modified Autocorrelation want to solve problem of differing n

the sounds of the speech – very long duration windows correspond to narrowband lowpass filters – want E n to change at a rate comparable to the changing sounds of the speech this is the essential conflict in all speech processing, namely we need short duration window to be responsive to rapid sound changes, but

Related Documents:

Domain Cheat sheet Domain 1: Security and Risk Management Domain 2: Asset Security Domain 3: Security Architecture and Engineering Domain 4: Communication and Network Security Domain 5: Identity and Access Management (IAM) Domain 6: Security Assessment and Testing Domain 7: Security Operations Domain 8: Software Development Security About the exam:

For the short time speech waveform, a speech power spectrum is calculated as a typical speech analysis. The frame is shifted with 128 points and then many short time speech waveforms can be obtained. Run-ning spectrum is defined as the time trajectory in frequency domain. It consists of many speech power spectra given from short time frames .

An Active Directory domain contains all the data for the domain which is stored in the domain database (NTDS.dit) on all Domain Controllers in the domain. Compromise of one Domain Controller and/or the AD database file compromises the domain. The Active Directory forest is the security boundary, not the domain.

Speech Enhancement Speech Recognition Speech UI Dialog 10s of 1000 hr speech 10s of 1,000 hr noise 10s of 1000 RIR NEVER TRAIN ON THE SAME DATA TWICE Massive . Spectral Subtraction: Waveforms. Deep Neural Networks for Speech Enhancement Direct Indirect Conventional Emulation Mirsamadi, Seyedmahdad, and Ivan Tashev. "Causal Speech

speech 1 Part 2 – Speech Therapy Speech Therapy Page updated: August 2020 This section contains information about speech therapy services and program coverage (California Code of Regulations [CCR], Title 22, Section 51309). For additional help, refer to the speech therapy billing example section in the appropriate Part 2 manual. Program Coverage

speech or audio processing system that accomplishes a simple or even a complex task—e.g., pitch detection, voiced-unvoiced detection, speech/silence classification, speech synthesis, speech recognition, speaker recognition, helium speech restoration, speech coding, MP3 audio coding, etc. Every student is also required to make a 10-minute

9/8/11! PSY 719 - Speech! 1! Overview 1) Speech articulation and the sounds of speech. 2) The acoustic structure of speech. 3) The classic problems in understanding speech perception: segmentation, units, and variability. 4) Basic perceptual data and the mapping of sound to phoneme. 5) Higher level influences on perception.

1 11/16/11 1 Speech Perception Chapter 13 Review session Thursday 11/17 5:30-6:30pm S249 11/16/11 2 Outline Speech stimulus / Acoustic signal Relationship between stimulus & perception Stimulus dimensions of speech perception Cognitive dimensions of speech perception Speech perception & the brain 11/16/11 3 Speech stimulus