A SpectralÕtemporal Method For Robust Fundamental .

3y ago
27 Views
2 Downloads
553.87 KB
13 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Albert Barnett
Transcription

A spectralÕtemporal method for robust fundamental frequencytrackingStephen A. Zahoriana兲 and Hongbing HuDepartment of Electrical and Computer Engineering, State University of New York at Binghamton,Binghamton, New York 13902, USA共Received 14 December 2006; revised 2 April 2008; accepted 7 April 2008兲In this paper, a fundamental frequency 共F0兲 tracking algorithm is presented that is extremely robustfor both high quality and telephone speech, at signal to noise ratios ranging from clean speech tovery noisy speech. The algorithm is named “YAAPT,” for “yet another algorithm for pitch tracking.”The algorithm is based on a combination of time domain processing, using the normalized crosscorrelation, and frequency domain processing. Major steps include processing of the originalacoustic signal and a nonlinearly processed version of the signal, the use of a new method forcomputing a modified autocorrelation function that incorporates information from multiple spectralharmonic peaks, peak picking to select multiple F0 candidates and associated figures of merit, andextensive use of dynamic programming to find the “best” track among the multiple F0 candidates.The algorithm was evaluated by using three databases and compared to three other published F0tracking algorithms by using both high quality and telephone speech for various noise conditions.For clean speech, the error rates obtained are comparable to those obtained with the best resultsreported for any other algorithm; for noisy telephone speech, the error rates obtained are lower thanthose obtained with other methods. 2008 Acoustical Society of America. 关DOI: 10.1121/1.2916590兴PACS number共s兲: 43.72.Ar, 43.72.Dv 关DOS兴I. INTRODUCTIONNumerous studies show the importance of prosody forhuman speech recognition, but only a few automatic systemsactually combine and use fundamental frequency 共F0兲,1 withother acoustic features in the recognition process to significantly increase the performance of automatic speech recognition 共ASR兲 systems 共Ostendorf and Ross, 1997; Shriberg etal., 1997; Ramana and Srichland, 1996; Wang and Seneff,2000; Bagshaw et al., 1993兲. F0 tracking is especially important for ASR in tonal languages, such as Mandarin speech,for which pitch patterns are phonemically important 共Wangand Seneff, 1998; Chang et al., 2000兲. Other applications foraccurate F0 tracking include devices for speech analysis,transmission, synthesis, speaker recognition, speech articulation training aids for the deaf 共Zahorian et al., 1998兲, andforeign language training. Despite decades of research, automatic F0 tracking is still not adequate for routine applicationsin ASR or for scientific speech measurements.An important consideration for any speech processingalgorithm is performance using telephone speech, due to themany applications of ASR in this domain. However, sincethe fundamental frequency is often weak or missing for telephone speech and the signal is distorted, noisy, and degradedin quality overall, pitch detection for telephone speech isespecially difficult 共Wang and Seneff, 2000兲.A number of pitch detection algorithms have been reported by using time domain and frequency domain methodswith varying degrees of accuracy 共Talkin, 1995; Liu and Lin,Author to whom correspondence should be addressed. Tel.: 共607兲 7774846. FAX: 共607兲 777-4464. Electronic mail: zahorian@binghamton.edu.a兲J. Acoust. Soc. Am. 123 共6兲, June 2008Pages: 4559–45712001; Boersma and Weenink, 2005; de Cheveigne andKawahara, 2002; Nakatani and Irino, 2004兲. Many studieshave compared the robustness of pitch tracking for a varietyof speech conditions 共Rabiner et al., 1976; Mousset et al.,1996; Parsa and Jamieson, 1999兲. However, robust pitchtracking methods, which can easily be integrated with otherspeech processing steps in ASR, are not widely available. Tomake available a public domain algorithm for accurate androbust pitch tracking, the methods presented in this in thispaper were developed.A key component in “yet another algorithm for pitchtracking” 共YAAPT兲 is the normalized cross correlation function 共NCCF兲 as used in the “robust algorithm for pitch tracking” 共RAPT兲 共Talkin, 1995兲. However, in early pilot testing,the NCCF alone did not reliably give good F0 tracks, especially for noisy and/or telephone speech. Frequently, theNCCF method alone resulted in gross F0 errors 共especiallyF0 doubling for telephone speech兲 that could easily be spotted by overlaying obtained F0 tracks with the low frequencypart of a spectrogram. YAAPT is the result of efforts to incorporate this observation in a formal algorithm.In this paper, we describe methods for enhancing andextracting spectrographic information and combining it withF0 estimates from correlation methods to create a more robust overall F0 track. Another innovation is to separatelycompute F0 candidates from both the original speech signaland a nonlinearly processed version of the signal and then tofind the “lowest cost” track among the candidates by usingdynamic programming. The basic elements of YAAPT werefirst given in the work of Kasi and Zahorian 共2002兲 andmodifications were described in the work of Zahorian et al.共2006兲. In this paper, we give a comprehensive description of0001-4966/2008/123共6兲/4559/13/ 23.00 2008 Acoustical Society of America4559

this paper are given in Table II. The algorithm is frame basedby using overlapping frames with frame lengths and framespacings as given in Table I.B. PreprocessingPreprocessing consists of creating multiple versions ofthe signal, as shown in the block diagram of Fig. 1. The keyidea is to create two versions of the signal: bandpass filteredversions of both the original and nonlinearly processed signals. The bandwidths 共50– 1500 Hz兲 and orders 共150 points兲of the bandpass finite impulse response 共FIR兲 filters wereempirically determined by inspection of many signals in timeand frequency and also by overall F0 tracking accuracy.These two signals are then independently processed to obtainF0 candidates by using the time domain NCCF algorithm, asdiscussed in Sec. II D.FIG. 1. 共Color online兲 Flow chart of YAAPT. Numbers in parentheses correspond to the steps listed in Sec. II A.the complete algorithm and extensive formal evaluation results.II. THE ALGORITHMA. Algorithm overviewThe F0 tracking algorithm presented in this paper performs F0 tracking in both the time domain and frequencydomain. As summarized in the flow chart in Fig. 1, the algorithm can be loosely divided into four main steps:共1兲 Preprocessing: Multiple versions of the signal are created via nonlinear processing 共Sec. II B兲.共2兲 F0 track calculation from the spectrogram of the nonlinearly processed signal: An approximate F0 track is estimated by using a spectral harmonics correlation 共SHC兲technique and dynamic programming. The normalizedlow frequency energy ratio 共NLFER兲 is also computedfrom the spectrogram as an aid for F0 tracking 共Sec.II C兲.共3兲 F0 candidate estimation based on the NCCF: Candidatesare extracted from both the original and nonlinearly processed signals with further candidate refinement basedon the spectral F0 track estimated in step 2 共Sec. II D兲.共4兲 Final F0 determination: Dynamic programming is applied to the information from steps 2 and 3 to arrive at afinal F0 track, including voiced/unvoiced decisions 共Sec.II E兲.The algorithm incorporates several experimentally determined parameters, such as F0 search ranges, thresholds forpeak picking, filter bandwidths, and dynamic programmingweights. These parameters are listed in Table I along withvalues used for experimental results reported in this paper.Similarly, to aid in the explanation of the algorithm and theerror measures used for evaluation, primary variables used in4560J. Acoust. Soc. Am., Vol. 123, No. 6, June 20081. Nonlinear processingNonlinear processing of a signal creates sum and difference frequencies, which can be used to partially restore amissing fundamental. Two types of nonlinear processing, theabsolute value of the signal and squared value of the signal,were considered. Since experimental evaluations indicatedslightly better F0 tracking accuracy by using the squaredvalue, the squared value was used for the primary experimental results reported in this paper. The general idea ofusing nonlinearities such as center clipping to emphasize F0has long been known 共see the work of Hess, 1983 for anextensive discussion兲 but appears not to be used in most ofthe pitch detectors developed since about 1990. For example,the pitch detectors YIN 共de Cheveigne and Kawahara, 2002兲and DASH 共Nakatani and Irino, 2004兲 do not make use ofnonlinearities. Of the seven pitch detectors evaluated byParsa and Jamieson 共1999兲, only one used a nonlinearity共center clipping兲. Most previous use of nonlinearities in F0detection algorithms was aimed at spectral flattening or reducing formant strength, rather than restoring a missing fundamental 共for example, the work of Rabiner and Schafer,1978兲.As shown in the work of Zahorian et al. 共2006兲, thefundamental frequency 共F0兲 reappears by squaring the signalin which the fundamental is either very weak or absent, suchas telephone speech. The restoration of the fundamental byusing the squaring operation is also illustrated by using spectrograms in Fig. 2. The top panel depicts the spectrogram ofa studio quality version of a speech signal, for which thefundamental frequency is clearly apparent. The middle panelshows the spectrogram of the telephone version of the samespeech sample, for which the fundamental frequency below200 Hz is largely missing. In contrast, the fundamental frequency is more clearly apparent in the spectrogram of thenonlinearly processed telephone signal shown in the bottompanel. A bandpass filter 共50– 1500 Hz兲 was used after thenonlinearity to reduce the magnitude of the dc component.This same effect was observed for many other examples.S. A. Zahorian and H. Hu: Spectral/temporal F0 tracking

TABLE I. Primary parameters used to configure YAAPT. Value 1 numbers are used to minimize gross errors;value 2 numbers are used to minimize big errors.ParameterF0 គ minF0 គ ��Thresh2NHWLSHCគthreshF0 គ ��pivotW1W2W3W4MeaningValue 1Value 2Minimum F0 searched 共Hz兲Maximum F0 searched 共Hz兲Length of each analysis frame 共ms兲Spacing between analysis frames 共ms兲FFT lengthLow frequency of bandpass filter passband 共Hz兲High frequency of bandpass filter passband 共Hz兲Order of bandpass filterMaximum number of F0 candidates per frameNLFER boundary for voiced/unvoiceddecisions, used in spectral F0 trackingThreshold for definitely unvoiced using NLFERNumber of harmonics in SHC calculationSHC window length 共Hz兲Threshold for SHC peak pickingF0 doubling/halving decision threshold 共Hz兲Threshold for considering a peak in NCCFThreshold for terminating search in NCCFMerit assigned to extra candidates in reducingF0 doubling and halving logicMerit assigned to unvoiced candidates indefinitely unvoiced framesDP weight factor for V-V transitionsDP weight factor for V-UV or VU-V transitionsDP weight factor for UV-UV transitionsOverall weight factor for local costs relative totransition 900.40.990.990.150.51000.070.150.50.10.9Spectrum of ningsSntfki, jTSHCF0 គ specF0 គ avgF0 គ stdSpeech signal in a frameMagnitude spectrum of speech signalTime sample index within a frameTime in terms of frame indexFrequency in HzLag index used in NCCF calculationsIndices uses used for F0 candidates within a frameNumber of signal framesSpectral harmonics correlationSpectarl F0 track, all voicedAverage of spectral F0 trackStandard deviation of F0 computed from spectral F0trackNormilized low frequency energy ratioFigure of merit for a F0 candidate, on a scale of 0 to 1Normilized cross correlation functionLongest lag evaluated for each frameShortest lag evaluated for each frameArthimetic average over all frames of the highest meritnonzero F0 candidates for each frameBack pointer array used in dynamic programmingError rate based on large errors in all frame wherereference indicates voiced speechAll large error, including those in Gគerr errors of thefrom UV to VNLFERmeritNCCFK គ minK គ maxF0 គ meanBPGគerrBគerrJ. Acoust. Soc. Am., Vol. 123, No. 6, June 200840030020010019.52020.52121.522Time (Seconds)Spectrumof theoriginalsignal speechSpectrogramof thetelephone19.52020.52121.522Time nonlinearsignal cy (Hz)Variable40030020010022.5500Frequency (Hz)TABLE II. Variable used in YAAPT on for evaluation of F0 tracking.Frequency (Hz)50040030020010021Time (Seconds)21.52222.5FIG. 2. 共Color online兲 Illustration of the effects of nonlinear processing ofthe speech signal. The spectrogram of a studio quality speech signal isshown in the top panel, the spectrogram of the telephone version of thesignal is shown in the middle panel, and the spectrogram of the squaredtelephone signal is shown in the bottom panel.S. A. Zahorian and H. Hu: Spectral/temporal F0 tracking4561

C. Spectrally based F0 track1. Spectral harmonics correlationOne way of determining the F0 from the spectrum is tofirst locate the spectral peak at the fundamental frequency.This requires that the peak at the fundamental frequency bepresent and identifiable, which is often not the case, especially for noisy telephone speech. Although the nonlinearprocessing described in the previous section partially restoresthe fundamental, additional techniques are needed to obtainan even more noise robust F0 track. Therefore, a frequencydomain autocorrelation type of function, which we call SHC,is used. This method is conceptually similar to the subharmonic summation method 共Hermes, 1988兲 and the discretelogarithmic Fourier transform 共Wang and Seneff, 2000兲, butthe details are quite different.The spectral harmonics correlation is defined to use multiple harmonics as follows:WL/2SHC共t, f兲 兺f in autocorrelationtype of functionfunctionSpectralharmonics correlation1兿r 1S共t,rf f 兲,J. Acoust. Soc. Am., Vol. 123, No. 6, June 400450FIG. 3. 共Color online兲 The peaks in the spectral harmonics correlation function. Compared to the small peak at the fundamental frequency of around220 Hz in the spectrum 共top兲, a very prominent peak is observed in thespectral harmonics correlation function 共bottom兲.For each frequency f, SHC共t , f兲, thus, represents the extent to which the spectrum has high amplitude at integermultiples of that f. The use of a window in frequency, empirically determined to be approximately 40 Hz, makes thecalculation less sensitive to noise, while still resulting inprominent peaks for SHC共t , f兲 at the fundamental frequency.The calculation is performed only for a limited search range共F0 គ min艋 f 艋 F0 គ max, with F0 គ min and F0 គ max values asgiven in Table I兲. Experiments were conducted to determinethe best value for the number of harmonics. Empirically, itappeared that NH 3 resulted in the most prominent peaks inSHC共t , f兲 for voiced speech and, thus, was used for the results given in this paper.Figure 3 shows the spectrum 共top panel兲 and the spectralharmonics correlation function 共bottom panel兲. Compared tothe small peak at the fundamental frequency of around220 Hz in the spectrum, a very prominent peak is observedin the spectral harmonics correlation function.2. Normalized low frequency energy ratioAnother primary use of spectral information in YAAPTis as an aid for making voicing decisions. The parameterused is referred to as the NLFER. The sum of spectralsamples 共the average energy per frame兲 over the low frequency regions is computed and then divided by the averagelow frequency energy per frame over the entire signal. Inequation form, NLFER is given byF0គmax兺NH 1where S共t , f兲 is the magnitude spectrum for frame t at frequency f, WL is the spectral window length in frequency,and NH is the number of harmonics. SHC共t , f兲 is then amplitude normalized so that the maximum value is 1.0 for eachframe. f is a discrete variable with a spacing dependent onFFT length and sampling rate, as mentioned previously.45620.30AmplitudeOne of the key features of YAAPT is the use of spectralinformation to guide F0 tracking. Spectral F0 tracks can bederived by using the spectral peaks which occur at the fundamental frequency and its harmonics. In this paper, it isexperimentally shown that the F0 track obtained from thespectrogram is useful for refining the F0 candidates estimatedfrom the acoustic waveform, especially in the case of noisytelephone speech. The spectral F0 track is computed by usingthe nonlinearly processed speech only.The initial motivation for exploring the use of spectralF0 tracks was that the examination of the low frequency partsof spectrograms revealed clear but smoothed F0 tracks, evenfor noisy speech. The resolution of the spectral F0 track depends on the frequency resolution of the spectral analysis,which, in turn, depends on both the frame length and fastFourier transform 共FFT兲 length used for spectral analysis.For the work reported in this paper, the values of these parameters are listed in Table I. Note that the frame lengthsused 共25 and 35 ms兲 are typical of those used in manyspeech processing applications. The FFT length of 8192 waschosen so that the spectrum was sampled at 2.44 Hz for asampling rate of 20 kHz, the highest rate used for speechdata evaluated in experiments reported in this paper. We hypothesized that this smoothed track could be used to guidethe NCCF processing but that the NCCF processing, with ahigh inherent time resolution of one sampling interval, wouldgive more accurate F0 estimates. Ultimately, experimentalevaluation is needed to check the accuracy of spectral F0tracking, versus NCCF-based tracking, versus a combinedapproach.SpectrumSpectrum0.4NLFER共t兲 S共t, f兲f 2 F0គminTF0គmax1兺T t 1兺,S共t, f兲f 2 F0គminwhere T is the total number of frames, and the frequencyrange, based on F0 គ min and F0 គ max, was empirically chosen to correspond to the expected range of F0. S共t , f兲 is thespectrum of the signal for frame t and frequency f. Note that,S. A. Zahorian and H. Hu: Spectral/temporal F0 tracking

with this definition, the average NLFER over all frames of anutterance is 1.0. In general, NLFER is high for voiced framesand low for unvoiced frames and, thus, NLFER is used asinformation for voiced/unvoiced decision making. In addition, NLFER is used to guide NCCF candidate selection共Sec. II D兲.3. Selection of F0 spectral candidates and spectral F0trackingBeginning with the SHC as described above, F0 candidates were selected, concatenated, and smoothed by usingthe following empirically determined method and parameters. Values of the parameters used in experiments throughout this paper are listed in Table I.共1兲 The frequency and amplitude of each SHC peak in eachframe above threshold SHCគThresh were selected asspectral F0 candidates and merits, respectively. For theexample shown in Fig. 3, two F0 candidates were selected. If the merit of the highest merit F0 candidate isless than SHCគThresh or if the NLFER is less thanNLFERគThresh1, the frame is considered unvoiced andnot considered in the following steps.共2兲 To reduce F0 doubling or halving for voiced frames 共apersistent problem with pitch trackers, e.g., the work ofNakatani and Irino, 2004兲, an additional candidate is inserted at half the frequency of the highest merit candidate if all the candidates are above the F0 doubling/having decision threshold F0 គ mid. Similarly, if allcandidates are below F0 គ mid, an additional F0 candidateis inserted at twice the frequency of the highest rankingcandidate. The merit of these

fundamental frequency F 0 reappears by squaring the signal in which the fundamental is either very weak or absent, such as telephone speech. The restoration of the fundamental by using the squaring operation is also illustrated by using spec-trograms in Fig. 2. The top panel depicts the spectrogram of

Related Documents:

Bruksanvisning för bilstereo . Bruksanvisning for bilstereo . Instrukcja obsługi samochodowego odtwarzacza stereo . Operating Instructions for Car Stereo . 610-104 . SV . Bruksanvisning i original

10 tips och tricks för att lyckas med ert sap-projekt 20 SAPSANYTT 2/2015 De flesta projektledare känner säkert till Cobb’s paradox. Martin Cobb verkade som CIO för sekretariatet för Treasury Board of Canada 1995 då han ställde frågan

service i Norge och Finland drivs inom ramen för ett enskilt företag (NRK. 1 och Yleisradio), fin ns det i Sverige tre: Ett för tv (Sveriges Television , SVT ), ett för radio (Sveriges Radio , SR ) och ett för utbildnings program (Sveriges Utbildningsradio, UR, vilket till följd av sin begränsade storlek inte återfinns bland de 25 största

Hotell För hotell anges de tre klasserna A/B, C och D. Det betyder att den "normala" standarden C är acceptabel men att motiven för en högre standard är starka. Ljudklass C motsvarar de tidigare normkraven för hotell, ljudklass A/B motsvarar kraven för moderna hotell med hög standard och ljudklass D kan användas vid

LÄS NOGGRANT FÖLJANDE VILLKOR FÖR APPLE DEVELOPER PROGRAM LICENCE . Apple Developer Program License Agreement Syfte Du vill använda Apple-mjukvara (enligt definitionen nedan) för att utveckla en eller flera Applikationer (enligt definitionen nedan) för Apple-märkta produkter. . Applikationer som utvecklas för iOS-produkter, Apple .

process in a database with temporal data dependencies and schema versioning. The update process supports the evolution of dependencies over time and the use of temporal operators within temporal data dependencies. The temporal dependency language is presented, along with the temporal

EPA Test Method 1: EPA Test Method 2 EPA Test Method 3A. EPA Test Method 4 . Method 3A Oxygen & Carbon Dioxide . EPA Test Method 3A. Method 6C SO. 2. EPA Test Method 6C . Method 7E NOx . EPA Test Method 7E. Method 10 CO . EPA Test Method 10 . Method 25A Hydrocarbons (THC) EPA Test Method 25A. Method 30B Mercury (sorbent trap) EPA Test Method .

speech enhancement techniques, DFT-based transforms domain techniques have been widely spread in the form of spectral subtraction [1]. Even though the algorithm has very . spectral subtraction using scaling factor and spectral floor tries to reduce the spectral excursions for improving speech quality. This proposed