Speech Enhancement Using A Minimum Mean-square Error Short-time .

1y ago
11 Views
2 Downloads
2.78 MB
24 Pages
Last View : 4d ago
Last Download : 3m ago
Upload by : Julius Prosser
Transcription

Available online at www.sciencedirect.comSpeech Communication 54 (2012) 282–305www.elsevier.com/locate/specomSpeech enhancement using a minimum mean-square errorshort-time spectral modulation magnitude estimatorKuldip Paliwal, Belinda Schwerin , Kamil WójcickiSignal Processing Laboratory, Griffith School of Engineering, Griffith University, Nathan, QLD 4111, AustraliaReceived 15 December 2010; received in revised form 7 September 2011; accepted 14 September 2011Available online 24 September 2011AbstractIn this paper we investigate the enhancement of speech by applying MMSE short-time spectral magnitude estimation in the modulation domain. For this purpose, the traditional analysis-modification-synthesis framework is extended to include modulation domainprocessing. We compensate the noisy modulation spectrum for additive noise distortion by applying the MMSE short-time spectral magnitude estimation algorithm in the modulation domain. A number of subjective experiments were conducted. Initially, we determine theparameter values that maximise the subjective quality of stimuli enhanced using the MMSE modulation magnitude estimator. Next, wecompare the quality of stimuli processed by the MMSE modulation magnitude estimator to those processed using the MMSE acousticmagnitude estimator and the modulation spectral subtraction method, and show that good improvement in speech quality is achievedthrough use of the proposed approach. Then we evaluate the effect of including speech presence uncertainty and log-domain processingon the quality of enhanced speech, and find that this method works better with speech uncertainty. Finally we compare the quality ofspeech enhanced using the MMSE modulation magnitude estimator (when used with speech presence uncertainty) with that enhancedusing different acoustic domain MMSE magnitude estimator formulations, and those enhanced using different modulation domain basedenhancement algorithms. Results of these tests show that the MMSE modulation magnitude estimator improves the quality of processedstimuli, without introducing musical noise or spectral smearing distortion. The proposed method is shown to have better noise suppression than MMSE acoustic magnitude estimation, and improved speech quality compared to other modulation domain based enhancement methods considered.Ó 2011 Elsevier B.V. All rights reserved.Keywords: Modulation domain; Analysis-modification-synthesis (AMS); Speech enhancement; MMSE short-time spectral magnitude estimator (AME);Modulation spectrum; Modulation magnitude spectrum; MMSE short-time modulation magnitude estimator (MME)1. IntroductionSpeech enhancement methods aim to improve the quality of noisy speech by reducing noise, while at the sametime minimising any speech distortion introduced by theenhancement process. Many enhancement methods arebased on the short-time Fourier analysis-modification-synthesis framework. Some examples of these are the spectralsubtraction method (Boll, 1979), the Wiener filter method(Wiener, 1949), and the MMSE short-time spectral amplitude estimation method (Ephraim and Malah, 1984). Corresponding author. Tel.: 61 7 3735 3754; fax: 61 7 3735 5198.E-mail address: belinda.schwerin@griffithuni.edu.au (B. Schwerin).0167-6393/ - see front matter Ó 2011 Elsevier B.V. All rights reserved.doi:10.1016/j.specom.2011.09.003Spectral subtraction is perhaps one of the earliest andmost extensively studied methods for speech enhancement.This simple method enhances speech by subtracting a spectral estimate of noise from the noisy speech spectrum ineither the magnitude or energy domain. Though thismethod is effective at reducing noise, it suffers from theproblem of musical noise distortion, which is very annoying to listeners. To overcome this problem, Ephraim andMalah (1984) proposed the MMSE short-time spectralamplitude estimator, referred to throughout this work asthe acoustic magnitude estimator (AME). In the literature(e.g., Cappe, 1994; Scalart and Filho, 1996), it has beensuggested that the good performance of the AME can belargely attributed to the use of the decision-directed

K. Paliwal et al. / Speech Communication 54 (2012) 282–305approach for estimation of the a priori signal-to-noise ratio(a priori SNR). The AME method, even today, remains oneof the most effective and popular methods for speechenhancement.Recently, the modulation domain has become popularfor speech processing. This has been in part due to thestrong psychoacoustic and physiological evidence, whichsupports the significance of the modulation domain forthe analysis of speech signals.1 Zadeh (1950) was perhapsthe first to propose a two-dimensional bi-frequency system,where the second dimension for frequency analysis was thetransform of the time variation of the magnitudes at eachstandard (acoustic) frequency. Atlas et al. (2004) morerecently defines the acoustic frequency as the axis of thefirst short-time Fourier transform (STFT) of the input signal and the modulation frequency as the independent variable of the second STFT transform.Early efforts to utilise the modulation domain for speechenhancement assumed speech and noise to be stationary,and applied fixed filtering on the trajectories of the acousticmagnitude spectrum. For example, Hermansky et al.(1995) proposed band-pass filtering the time trajectoriesof the cubic-root compressed short-time power spectrumto enhance speech. Falk et al. (2007) and Lyons and Paliwal (2008) applied similar band-pass filtering to the timetrajectories of the short-time magnitude (power) spectrumfor speech enhancement.However, speech and possibly noise are known to benonstationary. To capture this nonstationarity, one optionis to assume speech to be quasi-stationary, and process thetrajectories of the acoustic magnitude spectrum on a shorttime basis. At this point it is useful to differentiate theacoustic spectrum from the modulation spectrum as follows. The acoustic spectrum is the STFT of the speech signal, while the modulation spectrum at a given acousticfrequency is the STFT of the time series of the acousticspectral magnitudes at that frequency. The short-timemodulation spectrum is thus a function of time, acousticfrequency and modulation frequency.This type of short-time processing in the modulationdomain has been used in the past for automatic speech recognition (ASR). Kingsbury et al. (1998), for example,applied a modulation spectrogram representation thatemphasised low-frequency amplitude modulations toASR for improved robustness in noisy and reverberantconditions. Tyagi et al. (2003) applied mel-cepstrum modulation features to ASR to give improved performance inthe presence of non-stationary noise. Short-time modulation domain processing has also been applied to objectivequality. For example, Kim and Oct (2004, 2005) as wellas Falk and Chan (2008) used the short-time modulationmagnitude spectrum to derive objective measures thatcharacterise the quality of processed speech.1A review of the significance of the modulation domain for humanspeech perception can be found in (Atlas and Shamma, 2003).283For speech enhancement, short-time modulationdomain processing was recently applied in the modulationspectral subtraction method (ModSSub) of Paliwal et al.(2010). Here, the spectral subtraction method was extendedto the modulation domain, enhancing speech by subtracting the noise modulation energy spectrum from the noisymodulation energy spectrum in an analysis-modificationsynthesis (AMS) framework. In ModSSub method, theframe duration used for computing the short-time modulation spectrum was found to be an important parameter,providing a trade-off between quality and level of musicalnoise. Increasing the frame duration reduced musical noise,but introduced a slurring distortion. A somewhat longframe duration of 256 ms was recommended as a goodcompromise. The disadvantages of using longer modulation domain analysis window are as follows. Firstly, weare assuming stationarity which we know is not the case.Secondly, quite a long portion is needed for the initial estimation of noise, and thirdly, as shown by Paliwal et al.(2011), speech quality and intelligibility is higher whenthe modulation magnitude spectrum is processed usingshort frame durations and lower when processed usinglonger frame durations. For these reasons, we aim to finda method better suited to the use of shorter modulationanalysis window durations.Since the AME method has been found to be more effective than spectral subtraction in the acoustic domain, inthis paper, we explore the effectiveness of this method inthe short-time modulation domain. For this purpose, thetraditional analysis-modification-synthesis framework isextended to include modulation domain processing, thenthe noisy modulation spectrum is compensated for additivenoise distortion by applying the MMSE short-time spectralmagnitude estimation algorithm. The advantage of applying a MMSE-based method is that it does not introducemusical noise and hence can be used with shorter framedurations in the modulation domain. The proposedapproach, referred to as the modulation magnitude estimator (MME), is demonstrated to give better noise removalthan the AME approach, without the musical noise ofthe spectral subtraction type approach, or the spectralsmearing of the ModSSub method. In the body of thispaper, we provide enhancement results for the case ofspeech corrupted by additive white Gaussian noise(AWGN). We have also investigated enhancement performance for various coloured noises and the results, includedin the Appendices, are shown to be qualitatively similar.The rest of the paper is organised as follows. Section 2details an AMS-based framework for enhancement in theshort-time modulation domain. In Section 3 we describethe proposed MME approach, then in Section 4 we givedetails of the experiments used to tune the parameters ofthe MME method. In Section 5, the performance of theMME method is evaluated by comparison to a numberof different speech enhancement approaches. In Section 6,we consider the effect of speech presence uncertainty andlog-domain processing on the performance of the MME

284K. Paliwal et al. / Speech Communication 54 (2012) 282–305method. In Sections 7 and 8, we compare the quality of theproposed MME method to a wider range of enhancementmethods, including different acoustic domain MMSE formulations and a number of modulation domain basedspeech enhancement methods. Final conclusions are drawnin Section 9.2. AMS-based framework for speech enhancement in theshort-time spectral modulation domainAs mentioned previously, many frequency domainspeech enhancement methods are based on the (acoustic)short-time Fourier AMS framework (e.g., Lim and Oppenheim, 1979; Berouti et al., 1979; Ephraim and Malah, 1984;Ephraim and Malah, 1985; Martin, 1994; Sim et al., 1998;Virag, 1999; Cohen, 2005; Loizou, 2005). A traditionalacoustic AMS procedure for speech enhancement consistsof three stages: (1) the analysis stage, where the noisyspeech is processed using the STFT analysis; (2) the modification stage, where the noisy spectrum is compensated fornoise distortion to produce the modified spectrum; and (3)the synthesis stage, where an inverse STFT operation is followed by overlap-add synthesis to reconstruct theenhanced signal. The above framework has recently beenextended to facilitate enhancement in the short-time spectral modulation domain (Paliwal et al., 2010). For this purpose, a secondary AMS procedure was utilized forframewise processing of the time series of each frequencycomponent of the acoustic magnitude spectra. In this section, the details of the AMS-based framework for speechenhancement in the short-time spectral modulation domainare briefly reviewed.Let us assume an additive noise model in which cleanspeech is corrupted by uncorrelated additive noise to produce noisy speech as given byxðnÞ ¼ sðnÞ þ dðnÞ;ð1Þwhere x(n), s(n), and d(n) are the noisy speech, cleanspeech, and noise signals, respectively, and n denotes a discrete-time index. The noisy speech signal is then processedusing the running STFT analysis (Vary and Martin, 2006)given byX l ðkÞ ¼N 1Xxðn þ lZÞvðnÞe j2pnk N ;ð2Þet al., 2001; Loizou, 2007; Paliwal and Wójcicki, 2008;Rabiner and Schafer, 2010).In polar form, the STFT of the speech signal can beexpressed asX l ðkÞ ¼ jX l ðkÞjej\X l ðkÞ ;where jXl(k)j denotes the acoustic magnitude spectrum and\Xl(k) denotes the acoustic phase spectrum. The time trajectories for each frequency component of the acousticmagnitude spectra are then processed framewise using asecond AMS procedure as outlined below. The runningSTFT is used to compute the modulation spectrum fromthe acoustic magnitude spectrum as followsX ‘ ðk; mÞ ¼2Note that frame duration and window duration mean the same thingand we use these two terms interchangeably in this paper.N 1XjX lþ‘Z ðkÞjuðlÞe j2plm N ;ð4Þl¼0where ‘ is the modulation frame index, k is the index of theacoustic frequency, m refers to the index of the modulationfrequency, N is the modulation frame duration (MFD) interms of acoustic frames, Z is the modulation frame shift(MFS) in terms of acoustic frames, and u(l) is the modulation analysis window function. The modulation spectrumcan be written in polar form asX ‘ ðk; mÞ ¼ jX ‘ ðk; mÞjej\X ‘ ðk;mÞ ;ð5Þwhere jX ‘ ðk; mÞj is the modulation magnitude spectrum,and \X ‘ ðk; mÞ is the modulation phase spectrum. In thepresent work, the modulation magnitude spectrum of cleanspeech is estimated from the noisy modulation magnitudespectrum, while the noisy modulation phase spectrum isleft unchanged.3 The modified modulation spectrum is thengiven by b j\X ‘ ðk;mÞY ‘ ðk; mÞ ¼ S;ð6Þ‘ ðk; mÞ e b where S‘ ðk; mÞ is an estimate of the clean modulationmagnitude spectrum. Eq. (6) can also be written in termsof spectral gain function, G‘ ðk; mÞ, applied to the modulation spectrum of noisy speech as followsY ‘ ðk; mÞ ¼ G‘ ðk; mÞX ‘ ðk; mÞ;ð7ÞwhereG‘ ðk; mÞ ¼n¼0where l refers to the acoustic frame index, k refers to theindex of the acoustic frequency, N is the acoustic frameduration2 (AFD) in samples, Z is the acoustic frame shift(AFS) in samples, and v(n) is the acoustic analysis windowfunction. In speech processing, an AFD of 20–40 ms alongwith an AFS of 10–20 ms and the Hamming analysis window are typically employed (e.g., Picone, 1993; Huangð3Þ b S ‘ ðk; mÞ jX ‘ ðk; mÞj:ð8ÞThe inverse STFT operation, followed by least-squaresoverlap-add synthesis (Quatieri, 2002), are then used tocompute the modified acoustic magnitude spectrum asgiven by3The relative importance of the modulation phase spectrum with respectto the modulation magnitude spectrum depends on the MFD. Forexample, the results of a recent study by Paliwal et al. (2011) suggest thatfor short MFDs (664 ms) the modulation phase spectrum does notsignificantly contribute towards speech intelligibility or quality.

K. Paliwal et al. / Speech Communication 54 (2012) 282–305285add synthesis, to the modified acoustic spectrum as givenby()N 1XXj2pðn lZÞk Nwðn lZÞY l ðkÞe:ð11ÞyðnÞ ¼lk¼0A block diagram of the AMS-based framework for speechenhancement in the short-time spectral modulation domainis shown in Fig. 1.3. Minimum mean-square error short-time spectralmodulation magnitude estimatorThe minimum mean-square error short-time spectralamplitude estimator of Ephraim and Malah (1984) hasbeen employed in the past for speech enhancement in theacoustic frequency domain with much success. In the present work we investigate its use in the short-time spectralmodulation domain. For this purpose the AMS-basedframework detailed in Section 2 is used. In the followingdiscussions we will refer to the original method by Ephraimand Malah (1984) as the MMSE acoustic magnitude estimator (AME), while the proposed modulation domainapproach will be referred to as the MMSE modulationmagnitude estimator (MME). The details of the MMEare presented in the remainder of this section.In the MME method, the modulation magnitude spectrum of clean speech is estimated from noisy observations.The proposed estimator minimises the mean-square errorbetween the modulation magnitude spectra of clean andestimated speech 2 b ¼ E jS ‘ ðk; mÞj S ‘ ðk; mÞ ;ð12ÞFig. 1. Block diagram of the AMS–based framework for speechenhancement in the short-time spectral modulation domain.jY l ðkÞj ¼X(wðl ‘ZÞ‘N 1X)Y ‘ ðk; mÞej2pðl ‘ZÞm N;ð9Þm¼0where w(l) is a synthesis window function. The modifiedacoustic magnitude spectrum is combined with the noisyacoustic phase spectrum,4 to produce the modified acousticspectrum as followsY l ðkÞ ¼ jY l ðkÞjej\X l ðkÞ :ð10ÞThe enhanced speech signal is constructed by applying theinverse STFT operation, followed by least-squares overlap4Typically, AMS-based speech enhancement methods modify only theacoustic magnitude spectrum while keeping the acoustic phase spectrumunchanged. One reason for this is that for Hamming-windowed frames of20–40 ms duration, the phase spectrum is considered unimportant forspeech enhancement (e.g., Wang and Lim, 1982; Shannon and Paliwal,2006).where E[ ] denotes the expectation operator. Closed formsolution to this problem in the acoustic spectral domainhas been reported by Ephraim and Malah (1984) underthe assumptions that speech and noise are additive in thetime domain, and that their individual short-time spectralcomponents are statistically independent, identically distributed, zero-mean Gaussian random variables. In thepresent work we make similar assumptions, namely that(1) speech and noise are additive in the short-time acousticspectral magnitude domain, i.e.,jX l ðkÞj ¼ jS l ðkÞj þ jDl ðkÞjð13Þand (2) the individual short-time modulation spectral components of S ‘ ðk; mÞ and D‘ ðk; mÞ are independent, identically distributed Gaussian random variables.The reasoning for the first assumption is that at highSNRs the phase spectrum remains largely unchanged byadditive noise distortion (Loizou, 2007). For the secondassumption, we can apply an argument similar to that ofEphraim and Malah (1984), where the central limit theorem is used to justify the statistical independence of spectral components of the Fourier transform. For the STFT,this assumption is valid only in the asymptotic sense, that

286K. Paliwal et al. / Speech Communication 54 (2012) 282–305is, when the frame duration is large. However, Ephraimand Malah have used an acoustic frame duration of32 ms in their formulation to get good results. In our useof the MMSE approach in the modulation domain, weshould also make the modulation frame duration to be aslarge as possible, however it must not be so large as to beadversely affected by the nonstationarity of the magnitudespectral sequence as mentioned in the introduction. Keeping Ephraim and Malah’s 32 ms acoustic frame duration inmind, we want to find a compromise between these twocompeting requirements. For this, we investigate in thispaper the performance of our method as a function ofmodulation frame duration.With the above assumptions in mind, the modulationmagnitude spectrum of clean speech can be estimated fromthe noisy modulation spectrum under the MMSE criterion(following Ephraim and Malah, 1984) as b ð14Þ S ‘ ðk; mÞ ¼ E½jS ‘ ðk; mÞjjX ‘ ðk; mÞ ¼ G‘ ðk; mÞjX ‘ ðk; mÞjð15Þwhere G‘ ðk; mÞ is the MMSE-MME spectral gain functiongiven bypffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffip m‘ ðk; mÞK½m‘ ðk; mÞ ;G‘ ðk; mÞ ¼ð16Þ2 c‘ ðk; mÞin which m‘(k, m) is defined asm‘ ðk; mÞ ,n‘ ðk; mÞc ðk; mÞ1 þ n‘ ðk; mÞ ‘and K[ ] is the following function hhhK½h ¼ exp ð1 þ hÞI 0þ hI 1;222ð17Þð18Þ2E½jS ‘ ðk; mÞj ð19Þ2E½jD‘ ðk; mÞj andc‘ ðk; mÞ ,jX ‘ ðk; mÞj2E½jD‘ ðk; mÞj2 ; k‘ 1 ðk; mÞþ ð1 aÞ max ½ c‘ ðk; mÞ 1; 0 ;ð21Þwhere a controls the trade-off between noise reduction andtransient distortion (Cappe, 1994; Ephraim and Malah,21984), k‘ ðk; mÞ is an estimate of k‘ ðk; mÞ , E½jD‘ ðk; mÞj ,and the a posteriori SNR estimate is obtained by2 c‘ ðk; mÞ ¼jX ‘ ðk; mÞj: k‘ ðk; mÞð20Þrespectively.Since in practice only noisy speech is observable, then‘(k, m) and c‘(k, m) parameters have to be estimated. Forthis task we apply the decision-directed approach (Ephraimand Malah, 1984) in the short-time spectral modulationdomain. In the decision-directed method the a prioriSNR is estimated by recursive averaging as followsð22ÞNote that limiting the minimum value of the a priori SNRhas a considerable effect on the nature of the residual noise(Ephraim and Malah, 1984; Cappe, 1994). For this reason,a lower bound nmin is typically used to prevent a priori SNRestimates falling below its prescribed value, i.e.,hi n‘ ðk; mÞ ¼ max n‘ ðk; mÞ; nmin :ð23ÞMany approaches have been employed in the literaturefor noise power spectrum estimation in the acoustic spectral domain (e.g., Scalart and Filho, 1996; Martin, 2001;Cohen and Berdugo, 2002; Loizou, 2007). In the presentwork, spectral modulation domain estimates are needed.For this task a simple procedure is employed, where an initial estimate of modulation power spectrum of noise iscomputed from six leading silence frames.5 This estimateis then updated during speech absence using a recursiveaveraging rule (e.g., Scalart and Filho, 1996; Virag,1999), applied in the modulation spectral domain asfollows k‘ ðk; mÞ ¼ u k‘ 1 ðk; mÞ þ ð1 uÞjX ‘ ðk; mÞj2 ;where I0( ) and I1( ) denote the modified Bessel functions ofzero and first order, respectively. In the above equationsn‘(k, m) and c‘(k, m) are interpreted (after McAulay andMalpass, 1980) as the a priori SNR, and the a posterioriSNR. These quantities are defined asn‘ ðk; mÞ , n‘ ðk; mÞ ¼ a 2 b S ‘ 1 ðk; mÞ ð24Þwhere u is a forgetting factor chosen depending on the stationarity of the noise. The speech presence or absence isdetermined using a statistical model-based voice activitydetection (VAD) algorithm by Sohn et al. (1999), appliedin the modulation spectral domain.64. Subjective tuning of MME parametersOne of the reasons for the good performance of theAME method of Ephraim and Malah (1984) is that itsparameters have been well tuned. In the current work, thisMMSE estimator is applied in the spectral modulationdomain. Consequently, the parameters of the proposedMME method need to be retuned.The adjustable parameters of the MME approachinclude the acoustic frame duration (AFD), acoustic frameshift (AFS), modulation frame duration (MFD), modulation frame shift (MFS), as well as the smoothing parameter5Using six non-overlapped frames in the modulation domain for initialnoise estimation, around 220 ms of leading silence is required.6More specifically, the decision-directed decision rule without hangover (Sohn et al., 1999) is used.

K. Paliwal et al. / Speech Communication 54 (2012) 282–305a and the lower bound nmin used in a priori SNR estimation. Tuning of some of these parameters can be done qualitatively from our knowledge of speech processing, and canbe fixed without further investigation. For example, speechcan be assumed to be approximately stationary over shortdurations, and therefore acoustic frameworks typically usea short AFD of around 20–40 ms (e.g., Picone, 1993;Huang et al., 2001; Loizou, 2007; Paliwal and Wójcicki,2008), which at the same time is long enough to providereliable spectral estimates. Based on these qualitative reasons, an AFD of 32 ms was selected in this work. We havealso chosen to use a 1 ms AFS to facilitate experimentationwith a wide range of frame sizes and shifts in the modulation domain, and to increase the adaptability of the proposed method to changes in signal characteristics. Forother parameters, subjective listening tests were conductedto determine values that maximise the subjective quality ofstimuli enhanced using the MME method.In the remainder of this section, we first describe detailscommon to subsequent experiments. These include thespeech corpus, settings used for stimuli generation andthe listening test procedure. We then present experiments,results, and discussions. This section is concluded with asummary of the tuned parameters.4.1. Speech corpusThe Noizeus speech corpus (Loizou, 2007; Hu and Loizou, 2007)7 was used for the experiments presented in thissection. The corpus contains 30 phonetically-balanced sentences belonging to six speakers (three males and threefemales), and each having an average length of around2.6 s. The recorded speech was originally sampled at25 kHz. The recordings were then downsampled to 8 kHzand filtered to simulate the receiving frequency characteristics of telephone handsets. The corpus includes stimuli withnon-stationary noises at different SNRs. For our experiments only the clean stimuli were used. Correspondingnoisy stimuli were generated by degrading the clean stimuliwith additive white Gaussian noise (AWGN) at 5 dB SNR.Since use of the entire corpus was not feasible for humanlistening tests, in our experiments four sentences wereemployed. Of these, two (sp20 and sp22 belonging to amale and female speaker) were used for parameter tuning,while the other two (sp10 and sp26 also belonging to a maleand female speaker) were used in subjective testing.4.2. StimuliThe settings used for the construction of MME stimuliare as follows. The Hamming window was used as boththe acoustic and modulation analysis window functions.The FFT-analysis length was set to 2N and 2N for acoustic7The Noizeus speech corpus is publicly available on-line at thefollowing url: http://www.utdallas.edu/ loizou/speech/noizeus.287and modulation domain processing, respectively. Leastsquares overlap-add synthesis (Quatieri, 2002) was usedfor both acoustic and modulation syntheses. The thresholdfor the statistical voice activity detector (Sohn et al., 1999)was set to 0.15, and the forgetting factor u for noise estimate updates was set to 0.98. The AFD was set to 32 msand the AFS was set to 1 ms. Other parameters used inthe construction of MME stimuli for experiments presented in this section, are as defined in the description ofeach experiment.4.3. Listening test procedureSubjective testing was done in the form of AB listeningtests that determined parameter preference. For each subjective experiment, listening tests were conducted in a quietroom. Participants were familiarised with the task during ashort practice session. The actual test consisted of stimulipairs played back in randomised order over closed circumaural headphones at a comfortable listening level. For eachstimuli pair, the listeners were presented with three labelledoptions on a computer and asked to make a subjective preference. The first and second options were used to indicate apreference for the corresponding stimuli, while the thirdoption was used to indicate a similar preference for bothstimuli. The listeners were instructed to use the third optiononly when they did not prefer one stimulus over the other.Pair-wise scoring was used, with a score of 1 awarded tothe preferred treatment, and 0 to the other. For the similarpreference response, each treatment was awarded a score of 0.5. Participants could re-listen to stimuli if required.4.4. Parameter tuning: modulation frame durationTypical modulation domain methods use modulationframe durations (MFDs) of around 250 ms (Greenbergand Kingsbury, 1997; Thompson and Atlas, 2003; Kim,2005; Falk and Chan, 2008; Wu et al., 2009; Falk andChan, 2010; Falk et al., 2010; Paliwal et al., 2010). However, recent experiments (Paliwal et al., 2011) suggest thatshorter MFDs may be better suited (in the context of intelligibility and quality) when processing the modulationmagnitude spectrum. Paliwal et al. (2011) also showed thatobjective quality decreased for increasing MFDs. In thisexperiment we evaluate the effect of MFD on the qualityof stimuli enhanced using the MME method.Enhanced stimuli were created by applying the MMEmethod (see Section 3) to noisy speech (see Section 4.1).Using a MFS of 2 ms, a 0.998, and nmin 25 dB,MFD values of 32, 48, 64, 128 and 256 ms were investigated. The quality of the resulting stimuli was assessedthrough subjective listening tests using the procedure givenin Section 4.3. Five subjects participated in this experiment.Each was presented with 40 comparisons. The sessionlasted approximately 10 min.Mean subjective preference scores as a function of MFDare given in Fig. 2. The results show that use of long MFDs

288K. Paliwal et al. / Speech Communication 54 (2012) 282–305Mean subjective preference score (%)100806040200324864128Modulation frame duration (ms)256Fig. 2. Mean subjective preference scores (%) for stimuli generated using MME with 2 ms MFS, a 0.998, nmin 25 dB, and MFD values of 32, 48, 64,128, and 256 ms.(such as 256 ms) reduce the quality of enhanced stimuli.The reason for this is that long frame durations cause spectral smearing, which can be heard as a reverberant type ofdistortion. On the other hand, use of short MFDs (such as32–64 ms) produce stimuli with higher quality. Use of a32 ms modulation frame duration is acceptable in the modulation domain for reasons similar to those used to justify a32 ms acoustic frame duration by Ephraim and Malah intheir MMSE formulation and discussed in Section 3. It isalso noted that the results of this experiment are consistentwith those reported by Paliwal et al. (2011), where shorterframe durations were found to work better for processingof the modulation magnitude spectrum. Based on theresults of this experiment, a MFD of 32 ms was selectedfor use in the experiments in presented in later sections.4.5. Parameter tuning: modulation frame shiftThe modulation frame shift (MFS) affects the ability ofthe MME method to adapt to changes in the properties ofthe signal, with shorter shifts offering some reduction in theintroduced distortion during more transient parts.However, smaller shifts also add to the computational costof the method.In this experiment, we evaluate the effect of MFS on thesubjective quality of speech corrupted with 5 dB AWGNenhanced with the MME method. For this experiment,MFD is set t

method. In Sections 7and8, we compare the quality of the proposed MME method to a wider range of enhancement methods, including different acoustic domain MMSE for-mulations and a number of modulation domain based speech enhancement methods. Final conclusions are drawn in Section 9. 2. AMS-based framework for speech enhancement in the

Related Documents:

Speech enhancement based on deep neural network s SE-DNN: background DNN baseline and enhancement Noise-universal SE-DNN Zaragoza, 27/05/14 3 Speech Enhancement Enhancing Speech enhancement aims at improving the intelligibility and/or overall perceptual quality of degraded speech signals using audio signal processing techniques

Speech Enhancement Speech Recognition Speech UI Dialog 10s of 1000 hr speech 10s of 1,000 hr noise 10s of 1000 RIR NEVER TRAIN ON THE SAME DATA TWICE Massive . Spectral Subtraction: Waveforms. Deep Neural Networks for Speech Enhancement Direct Indirect Conventional Emulation Mirsamadi, Seyedmahdad, and Ivan Tashev. "Causal Speech

component for speech enhancement . But, recently, the [15] phase value also considered for efficient noise suppression in speech enhancement [5], [16]. The spectral subtraction method is the most common, popular and traditional method of additive noise cancellation used in speech enhancement. In this method, the noise

channel speech enhancement in the time domain. Traditional monaural speech enhancement approaches in-clude spectral subtraction, Wiener filtering and statistical model-based methods [1]. Speech enhancement has been extensively studied in recent years as a supervised learning This research was supported in part by two NIDCD grants (R01DC012048

2 The proposed BDSAE speech enhancement method In this section, we first present conventional spectral ampli-tude estimation scheme for speech enhancement. Then, the proposed speech enhancement scheme based on Bayesian decision and spectral amplitude estimation is described. Finally, we derive the optimal decision rule and spectral

modulation spectral subtraction with the MMSE method. The fusion is performed in the short-time spectral domain by combining the magnitude spectra of the above speech enhancement algorithms. Subjective and objective evaluation of the speech enhancement fusion shows consistent speech quality improvements across input SNRs. Key words: Speech .

Keywords: Speech Enhancement, Spectral Subtraction, Kalman filter, Musical noise 1. INTRODUCTION Speech enhancement is used to improve intelligibility and overall perceptual quality of degraded speech using various algorithms and audio signal processing techniques. The aim of speech

Additif alimentaire : substance qui n’est habituellement pas consommée comme un aliment ou utilisée comme un ingrédient dans l’alimentation. Ils sont ajoutés aux denrées dans un but technologique au stade de la fabrication, de la transformation, de la préparation, du traitement, du conditionnement, du transport ou de l’entreposage des denrées et se retrouvent donc dans la .