Single-channel Speech Enhancement Using Spectral Subtraction In The .

1y ago
12 Views
2 Downloads
4.75 MB
24 Pages
Last View : 8d ago
Last Download : 3m ago
Upload by : Abram Andresen
Transcription

Single-channel speech enhancement using spectral subtractionin the short-time modulation domainKuldip Paliwal, Kamil Wójcicki and Belinda SchwerinSignal Processing Laboratory, Griffith School of Engineering, Griffith University, Nathan QLD 4111, AustraliaAbstractIn this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement.More specifically, we wish to determine how competitive the modulation domain is for spectral subtraction as comparedto the acoustic domain. For this purpose, we extend the traditional analysis-modification-synthesis framework to includemodulation domain processing. We then compensate the noisy modulation spectrum for additive noise distortion byapplying the spectral subtraction algorithm in the modulation domain. Using an objective speech quality measure as wellas formal subjective listening tests, we show that the proposed method results in improved speech quality. Furthermore,the proposed method achieves better noise suppression than the MMSE method. In this study, the effect of modulationframe duration on speech quality of the proposed enhancement method is also investigated. The results indicate thatmodulation frame durations of 180–280 ms, provide a good compromise between different types of spectral distortions,namely musical noise and temporal slurring. Thus given a proper selection of modulation frame duration, the proposedmodulation spectral subtraction does not suffer from musical noise artifacts typically associated with acoustic spectralsubtraction. In order to achieve further improvements in speech quality, we also propose and investigate fusion ofmodulation spectral subtraction with the MMSE method. The fusion is performed in the short-time spectral domain bycombining the magnitude spectra of the above speech enhancement algorithms. Subjective and objective evaluation ofthe speech enhancement fusion shows consistent speech quality improvements across input SNRs.Key words: Speech enhancement, modulation spectral subtraction, speech enhancement fusion,analysis-modification-synthesis (AMS), musical noise1. IntroductionSpeech enhancement aims at improving the quality ofnoisy speech. This is normally accomplished by reducingthe noise (in such a way that the residual noise is notannoying to the listener), while minimising the speechdistortion introduced during the enhancement process. Inthis paper we concentrate on the single-channel speechenhancement problem, where the signal is derived from asingle microphone. This is especially useful in mobile communication applications, where only a single microphoneis available due to cost and size considerations.Many popular single-channel speech enhancementmethods employ the analysis-modification-synthesis(AMS) framework (Allen, 1977; Allen and Rabiner,1977; Crochiere, 1980; Portnoff, 1981; Griffin and Lim,1984; Quatieri, 2002) to perform enhancement in theacoustic spectral domain (Loizou, 2007). The AMSframework consists of three stages: 1) the analysis stage,where the input speech is processed using the short-timePreprint submitted to Speech CommunicationFourier transform (STFT) analysis; 2) the modificationstage, where the noisy spectrum undergoes some kindof modification; and 3) the synthesis stage, where theinverse STFT is followed by the overlap-add synthesis toreconstruct the output signal. In this paper, we investigatespeech enhancement in the modulation spectral domainby extending the acoustic AMS framework to includemodulation domain processing.Zadeh (1950) was perhaps the first to propose a twodimensional bi-frequency system, where the second dimension for frequency analysis was the transform of thetime variation of the standard (acoustic) frequency. Morerecently, Atlas et al. (2004) defined acoustic frequency asthe axis of the first STFT of the input signal and modulation frequency as the independent variable of the secondSTFT transform. We therefore differentiate the acousticspectrum from the modulation spectrum as follows. Theacoustic spectrum is the STFT of the speech signal, whilethe modulation spectrum at a given acoustic frequencyMarch 31, 2010

is the STFT of the time series of the acoustic spectralmagnitudes at that frequency. The short-time modulationspectrum is thus a function of time, acoustic frequency andmodulation frequency.There is growing psychoacoustic and physiological evidence to support the significance of the modulationdomain in the analysis of speech signals. Experimentsof Bacon and Grantham (1989), for example, showedthat there are channels in the auditory system whichare tuned for the detection of modulation frequencies.Sheft and Yost (1990) showed that our perception oftemporal dynamics corresponds to our perceptual filteringinto modulation frequency channels and that faithfulrepresentation of these modulations is critical to ourperception of speech. Experiments of Schreiner andUrbas (1986) showed that a neural representation ofamplitude modulation is preserved through all levels of themammalian auditory system, including the highest levelof audition, the auditory cortex. Neurons in the auditorycortex are thought to decompose the acoustic spectruminto spectro-temporal modulation content (Mesgarani andShamma, 2005), and are best driven by sounds that combine both spectral and temporal modulations (Kowalskiet al., 1996; Shamma, 1996; Depireux et al., 2001).Low frequency modulations of sound have been shownto be the fundamental carriers of information in speech(Atlas and Shamma, 2003). Drullman et al. (1994b,a),for example, investigated the importance of modulationfrequencies for intelligibility by applying low-pass andhigh-pass filters to the temporal envelopes of acousticfrequency subbands. They showed frequencies between4 and 16 Hz to be important for intelligibility, with theregion around 4-5 Hz being the most significant. In asimilar study, Arai et al. (1996) showed that applyingband-pass filters between 1 and 16 Hz does not impairspeech intelligibility.While the envelope of the acoustic magnitude spectrumrepresents the shape of the vocal tract, the modulationspectrum represents how the vocal tract changes as afunction of time. It is these temporal changes that conveymost of the linguistic information (or intelligibility) ofspeech. In the above intelligibility studies, the lower limitof 1 Hz stems from the fact that the slow vocal tractchanges do not convey much linguistic information. Inaddition, the lower limit helps to make speech communication more robust, since the majority of noises occurringin nature vary slowly as a function of time and hencetheir modulation spectrum is dominated by modulationfrequencies below 1 Hz. The upper limit of 16 Hz is dueto the physiological limitation on how fast the vocal tractis able to change with time.Modulation domain processing has grown in popularityfinding applications in areas such as speech coding (Atlasand Vinton, 2001; Thompson and Atlas, 2003; Atlas,2003), speech recognition (Hermansky and Morgan, 1994;Nadeu et al., 1997; Kingsbury et al., 1998; Kanederaet al., 1999; Tyagi et al., 2003; Xiao et al., 2007; Luet al., 2010), speaker recognition (Vuuren and Hermansky,1998; Malayath et al., 2000; Kinnunen, 2006; Kinnunenet al., 2008), objective speech intelligibility evaluation(Steeneken and Houtgast, 1980; Payton and Braida, 1999;Greenberg and Arai, 2001; Goldsworthy and Greenberg,2004; Kim, 2004) as well as speech enhancement. In thelatter category, a number of modulation filtering methodshave emerged. For example, Hermansky et al. (1995)proposed the band-pass filtering of the time trajectoriesof cubic-root compressed short-time power spectrum forenhancement of speech corrupted by additive noise. Morerecently in (Falk et al., 2007; Lyons and Paliwal, 2008),similar band-pass filtering was applied to the time trajectories of the short-time power spectrum for speechenhancement.There are two main limitations associated with typicalmodulation filtering methods. First, they use a filterdesign based on the long-term properties of the speechmodulation spectrum, while ignoring the properties ofnoise. As a consequence, they fail to eliminate noisecomponents present within the speech modulation regions.Second, the modulation filter is fixed and applied to theentire signal, even though the properties of speech andnoise change over time. In the proposed method, weattempt to address these limitations by processing themodulation spectrum on a frame-by-frame basis. In ourapproach, we assume the noise to be additive in natureand enhance noisy speech by applying spectral subtractionalgorithm, similar to the one proposed by Berouti et al.(1979), in the modulation domain.In this paper, we evaluate how competitive the modulation domain is for speech enhancement as comparedto the acoustic domain. For this purpose, objective andsubjective speech enhancement experiments were carriedout. The results of these experiments demonstrate that themodulation domain is a useful alternative to the acousticdomain. We also investigate fusion of the proposedtechnique with the MMSE method for further speechquality improvements.In the main body of this paper, we provide the enhancement results for the case of speech corrupted by additivewhite Gaussian noise (AWGN). We have also investigatedenhancement performance for various coloured noises andthe results were found to be qualitatively similar. In ordernot to clutter the main body of this paper, we include theresults for the coloured noises in Appendix C.The rest of this paper is organised as follows. Section 2details the traditional AMS-based speech processing. Section 3 presents details of the proposed modulation domainspeech enhancement method along with the discussion ofobjective and subjective enhancement experiments andtheir results. Section 4 gives the details of the proposedspeech enhancement fusion algorithm, along with experimental evaluation and results. Final conclusions are drawnin Section 5.2

Noisy speech x(n)2. Acoustic analysis-modification-synthesisOverlapped framing with analysis windowingLet us consider an additive noise modelx(n) s(n) d(n),(1)where n is the discrete-time index, while x(n), s(n) andd(n) denote discrete-time signals of noisy speech, cleanspeech and noise, respectively. Since speech can beassumed to be quasi-stationary, it is analysed frame-wiseusing the short-time Fourier analysis. The STFT of thecorrupted speech signal x(n) is given byX(n, k) XX(n, k) X(n, k) ej X(n,k)Fourier transformAcoustic magnitude spectrumModified acoustic magnitude spectrumAcousticphasespectrum X(n, k)X(n, k)b k)S(n,b k) ej X(n,k)Modified acoustic spectrum Y (n, k) S(n,Inverse Fourier transformx(l)w(n l)e j2πkl/N ,(2)Overlap add with synthesis windowingl where k refers to the index of the discrete acousticfrequency, N is the acoustic frame duration (in samples)and w(n) is an acoustic analysis window function.1 Inspeech processing, the Hamming window with 20–40 msduration is typically employed (Paliwal and Wójcicki,2008). Using STFT analysis we can represent Eq. (1) asX(n, k) S(n, k) D(n, k),Enhanced speech y(n)Fig. 1: Block diagram of a traditional AMS-based acoustic domainspeech enhancement procedure.The enhanced speech signal, y(n), is constructed bytaking the inverse STFT of the modified acoustic spectrumfollowed by least-squares overlap-add synthesis (Griffinand Lim, 1984; Quatieri, 2002):(3)where X(n, k), S(n, k), and D(n, k) are the STFTs of noisyspeech, clean speech, and noise, respectively. Each of thesecan be expressed in terms of acoustic magnitude spectrumand acoustic phase spectrum. For instance, the STFT ofthe noisy speech signal can be written in polar form asX(n, k) X(n, k) ej X(n,k), X1y(n) W0 (n)l (4)N 11 XY (l, k)ej2πnk/NNk 0!#ws (l n) ,(6)where ws (n) is the synthesis window function, and W0 (n)is given by Xws2 (l n).(7)W0 (n) where X(n, k) denotes the acoustic magnitude spectrumand X(n, k) denotes the acoustic phase spectrum.2Traditional AMS-based speech enhancement methodsmodify, or enhance, only the noisy acoustic magnitudespectrum while keeping the noisy acoustic phase spectrumunchanged. The reason for this is that for Hammingwindowed frames (of 20–40 ms duration) the phase spectrum is considered unimportant for speech enhancement(Wang and Lim, 1982; Shannon and Paliwal, 2006). Suchalgorithms attempt to estimate the magnitude spectrumof clean speech. Let us denote the enhanced magnitudeb k) , then the modified spectrum isspectrum as S(n,b k) with the noisy phaseconstructed by combining S(n,spectrum, as followsb k) ej X(n,k) .Y (n, k) S(n,"l In the present study, as the synthesis window we employthe modified Hanning window (Griffin and Lim, 1984),given by8”“ 0.5 0.5 cos 2π(n 0.5) ,Nws (n) : 0,0 n Notherwise.(8)Note that the use of the modified Hanning window meansthat W0 (n) in Eq. (7) is constant (i.e., independent of n).A block diagram of a traditional AMS-based speechenhancement framework is shown in Fig. 1.(5)3. Modulation spectral subtraction3.1. Introduction1 Note that in principle, Eq. (2) could be computed for everyacoustic sample, however, in practice it is typically computed foreach acoustic frame (and acoustic frames are progressed by someframe shift). We do not show this decimation explicitly in order tokeep the mathematical notation concise.2 In our discussions, when referring to the magnitude, phase or(complex) spectra, the STFT modifier is implied unless otherwisestated.Also, wherever appropriate, we employ the acousticand modulation modifiers to disambiguate between acoustic andmodulation domains.Classical spectral subtraction (Boll, 1979; Berouti et al.,1979; Lim and Oppenheim, 1979) is an intuitive andeffective speech enhancement method for the removal ofadditive noise. Spectral subtraction does, however, sufferfrom perceptually annoying spectral artifacts refered to asmusical noise. Many approaches that attempt to addressthis problem have been investigated in the literature (e.g.,3

b k, m) S(η, X (η, k, m)γb k, m) ρ D(η, 1 b k, m) γ γ , β D(η,γ γ1if X (η, k, m),otherwiseb k, m) β D(η,γ(11)(10)where X (η, k, m) is the modulation magnitude spectrumand X (η, k, m) is the modulation phase spectrum.b k, m) ,We propose to replace X (η, k, m) with S(η,bwhere S(η, k, m) is an estimate of clean modulationmagnitude spectrum obtained using a spectral subtraction rule similar to the one proposed by Berouti et al.(1979) and given by Eq. (11). In Eq. (11), ρ denotesthe subtraction factor that governs the amount of oversubtraction; β is the spectral floor parameter used toset spectral magnitude values falling below the spectral 1 b k, m) γ γ , to that spectral floor; and γfloor, β D(η,determines the subtraction domain, e.g., for γ set to unitythe subtraction is performed in the magnitude spectraldomain, while for γ 2 the subtraction is performed inthe magnitude-squared spectral domain.The estimate of the modulation magnitude spectrum ofb k, m) , is obtained based onthe noise, denoted by D(η,a decision from a simple voice activity detector (VAD)(Loizou, 2007), applied in the modulation domain. TheVAD classifies each modulation domain segment as either1 (speech present) or 0 (speech absent), using the followingbinary rule 1,if φ(η, k) θΦ(η, k) ,(12)0,otherwiseThe proposed speech enhancement method extends thetraditional AMS-based acoustic domain enhancement tothe modulation domain. To achieve this, each frequencycomponent of the acoustic magnitude spectra, obtainedduring the analysis stage of the acoustic AMS procedureoutlined in Section 2, is processed frame-wise across timeusing a secondary (modulation) AMS framework. Thusthe modulation spectrum is computed using STFT analysis as followsX(l, k) v(η l)e j2πml/M ,γX (η, k, m) X (η, k, m) ej X (η,k,m) ,3.2. ProcedureX (η, k, m) b k, m) ρ D(η,where η is the acoustic frame number,3 k refers to the indexof the discrete acoustic frequency, m refers to the index ofthe discrete modulation frequency, M is the modulationframe duration (in terms of acoustic frames) and v(η) isa modulation analysis window function. The resultingspectra can be expressed in polar form asVaseghi and Frayling-Cork, 1992; Cappe, 1994; Virag,1999; Hasan et al., 2004; Hu and Loizou, 2004; Lu, 2007).In this section, we propose to apply the spectral subtraction algorithm in the short-time modulation domain. Traditionally, the modulation spectrum has been computedas the Fourier transform of the intensity envelope of aband-pass filtered signal (e.g., Houtgast and Steeneken,1985; Drullman et al., 1994a; Goldsworthy and Greenberg,2004). The method proposed in our study, however,uses the short-time Fourier transform (STFT) insteadof band-pass filtering. In the acoustic STFT domain,the quantity closest to the intensity envelope of a bandpass filtered signal is the magnitude-squared spectrum.However, in the present paper we use the time trajectoriesof the short-time acoustic magnitude spectrum for thecomputation of the short-time modulation spectrum. Thischoice is motivated from more recently reported papersdealing with modulation-domain processing based speechapplications (Falk et al., 2007; Kim, 2005), and is alsojustified empirically in Appendix B. Once the modulationspectrum is computed, spectral subtraction is done inthe modulation magnitude-squared domain. Empiricaljustification for use of modulation magnitude-squaredspectra is also given in Appendix B.The proposed approach is then evaluated through bothobjective and subjective speech enhancement experimentsas well as through spectrogram analysis. We show thatgiven a proper selection of modulation frame duration, theproposed method results in improved speech quality anddoes not suffer from musical noise artifacts. Xγwhere φ(η, k) denotes a modulation segment SNR computed as follows P 2X (η, k, m) φ(η, k) 10 log10 Pm(13)2 bD(η 1,k, m)m3 Note that in principle, Eq. (9) could be computed for everyacoustic frame, however, in practice we compute it for everymodulation frame. We do not show this decimation explicitly inorder to keep the mathematical notation concise.(9)l 4

Noisy speech x(n)(Hermansky et al., 1995). In the present work, wekeep X (η, k, m) unchanged, however, future work willinvestigate approaches that can be used to enhance it. Inthe present study, we obtain the estimate of the modifiedb k) , by taking theacoustic magnitude spectrum S(n,inverse STFT of Z(η, k, m) followed by overlap-add withsynthesis windowing. A block diagram of the proposedapproach is shown in Fig. 2.Overlapped framing with analysis windowingFourier transformX(n, k) X(n, k) ej X(n,k)X(n, k)Acoustic magnitude spectrumModified modulation magnitude spectrumb k, m)S(η,Modified modulation spectrumb k, m) ej X (η,k,m)Z(η, k, m) S(η,Inverse Fourier transform X(n, k)X (η, k, m)3.3. ExperimentsIn this section we detail objective and subjective speechenhancement experiments that assess the suitability ofmodulation spectral subtraction for speech enhancement.Acoustic phase spectrumModulation magnitude spectrumModulation phase spectrumFourier transform X (η, k, m) X (η, k, m) ej X (η,k,m) X (η, k, m)kOverlapped framing with analysis windowing3.3.1. Speech corpusIn our experiments we employ the Noizeus speechcorpus (Loizou, 2007; Hu and Loizou, 2007).5 Noizeus iscomposed of 30 phonetically-balanced sentences belongingto six speakers, three males and three females. The corpusis sampled at 8 kHz and filtered to simulate receivingfrequency characteristics of telephone handsets. Noizeuscomes with non-stationary noises at different SNRs. Forour experiments we keep the clean part of the corpusand generate noisy stimuli by degrading the clean stimuliwith additive white Gaussian noise (AWGN) at variousSNRs. The noisy stimuli are constructed such that theybegin with a noise only section long enough for (initial)noise estimation in both acoustic and modulation domains(approx. 500 ms).Overlap add with synthesis windowingkModified acoustic magnitude spectrumb k)S(n,b k) ej X(n,k)Modified acoustic spectrum Y (n, k) S(n,Inverse Fourier transformOverlap add with synthesis windowing3.3.2. Stimuli typesModulation spectral subtraction (ModSpecSub) stimuliwere constructed using the procedure detailed in Section 3.2. The acoustic frame duration was set to 32 ms,with an 8 ms frame shift and the modulation frameduration was set to 256 ms, with a 32 ms frame shift.Note that modulation frame durations between 180 msand 280 ms were found to work well. However, at shorterdurations the musical noise was present, while at longerdurations a slurring effect was observed. The durationof 256 ms was chosen as a good compromise. A moredetailed look at the effect of modulation frame durationon speech quality of ModSpecSub stimuli is presented inAppendix A. The Hamming window was used for boththe acoustic and modulation analysis windows. The FFTanalysis length was set to 2N and 2M for the acousticand modulation AMS frameworks, respectively. The valueof the subtraction parameter ρ was selected as describedin (Berouti et al., 1979). The spectral floor parameter βwas set to 0.002. Magnitude-squared spectral subtractionwas used in the modulation domain, i.e., γ 2. The speechpresence threshold θ was set to 3 dB. The forgetting factorλ was set to 0.98. Griffith and Lim’s method for windowedEnhanced speech y(n)Fig. 2: Block diagram of the proposed AMS-based modulation domainspeech enhancement procedure.and θ is an empirically determined speech presence threshold. The noise estimate is updated during speech absenceusing the following averaging rule (Virag, 1999)b k, m)D(η,γb λ D(η 1,k, m)γγ (1 λ) X (η, k, m) ,(14)where λ is a forgetting factor chosen depending on thestationarity of the noise.4The modified modulation spectrum is produced byb k, m) with the noisy modulation phasecombining S(η,spectrum as followsb k, m) ej X (η,k,m) .Z(η, k, m) S(η,(15)Note that unlike the acoustic phase spectrum, the modulation phase spectrum does contain useful information4 Note that due to the temporal processing over relatively longframes, the use of VAD for noise estimation will not achieve trulyadaptive noise estimates. This is one of the limitations of theproposed method as discussed in Section 3.4.5 The Noizeus speech corpus is publicly available on-line at thefollowing url: http://www.utdallas.edu/ loizou/speech/noizeus.5

termined using a voice activity detection (VAD) algorithm,based on a simple segmental SNR measure (Loizou, 2007).In the MMSE method, optimal estimates (in the minimummean square error sense) of the short-time spectral amplitudes were computed. The decision-directed approachwas used for the a priori SNR estimation, with thesmoothing factor α set to 0.98.6 In the MMSE method,noise spectrum estimates were computed from non-speechframes using recursive averaging with speech presence orabsence determined using a log-likelihood ratio based VAD(Loizou, 2007). Further details on the implementation ofboth methods are given in (Loizou, 2007).In addition to the ModSpecSub, SpecSub, and MMSEstimuli, clean and noisy speech stimuli were also includedin our experiments. Example spectrograms for the abovestimuli are shown in Fig. 3.7,83.3.3. Objective experimentThe objective experiment was carried out over theNoizeus corpus for AWGN at 0, 5, 10 and 15 dB SNR.Perceptual evaluation of speech quality (PESQ) (Rix et al.,2001) was used to predict mean opinion scores for thestimuli types outlined in Section 3.3.2.3.3.4. Subjective experimentThe subjective evaluation was in a form of AB listeningtests that determine method preference. Two Noizeussentences (sp10 and sp27) belonging to male and femalespeakers were included. AWGN at 5 dB SNR was investigated. The stimuli types detailed in Section 3.3.2 wereincluded. Fourteen English speaking listeners participatedin this experiment. None of the participants reportedany hearing defects. The listening tests were conductedin a quiet room. The participants were familiarised withthe task during a short practice session. The actual testconsisted of 40 stimuli pairs played back in randomisedorder over closed circumaural headphones at a comfortablelistening level. For each stimuli pair, the listeners werepresented with three labeled options on a digital computerand asked to make a subjective preference. The first andsecond options were used to indicate a preference for thecorresponding stimuli, while the third option was used toindicate a similar preference for both stimuli. The listenerswere instructed to use the third option only when they didFig. 3: Spectrograms of sp10 utterance, “The sky that morningwas clear and bright blue”, by a male speaker from the Noizeusspeech corpus: (a) clean speech (PESQ: 4.50); (b) speech degradedby AWGN at 5 dB SNR (PESQ: 1.80); as well as the noisy speechenhanced using: (c) acoustic spectral subtraction (SpecSub) (Beroutiet al., 1979) (PESQ: 2.07); (d) the MMSE method (Ephraim andMalah, 1984) (PESQ: 2.26); and (e) modulation spectral subtraction(ModSpecSub) (PESQ: 2.42).overlap-add synthesis (Griffin and Lim, 1984) was used forboth acoustic and modulation syntheses.6 Please note that in the decision-directed approach for the a prioriSNR estimation, the smoothing parameter α has a significant effecton the type and intensity of the residual noise present in theenhanced speech (Cappe, 1994). While the MMSE stimuli usedin the experiments presented in the main body of this paper wereconstructed with α set to 0.98, a supplementary examination of theeffect of α on speech quality of the MMSE stimuli is provided inAppendix D.7 Note that all spectrograms, presented in this study, have thedynamic range set to 60 dB. The highest spectral peaks are shownin black, while the lowest spectral valleys ( 60 dB below the highestpeaks) are shown in white. Shades of gray are used in-between.8 The audio stimuli files are available on-line from the followingurl: b/.For our experiments we have also generated stimuliusing two popular speech enhancement methods, namelythe acoustic spectral subtraction (SpecSub) (Berouti et al.,1979) and the MMSE method (Ephraim and Malah,1984). Publicly available reference implementation ofthese methods (Loizou, 2007) was employed in our study.In the SpecSub method, the subtraction was performedin the magnitude-squared spectral domain, with the noisespectrum estimates obtained through recursive averagingof non-speech frames. Speech presence or absence was de6

bNoisy1.750.25ecSub2.250.50ModSp2.500.75CleanMean PESQ2.751.00MMSEMean preference score3.00Stimulus typeInput SNR (dB)Fig. 5: Speech enhancement results for the subjective experimentdetailed in Section 3.3.4. The results are in terms of mean preferencescores for AWGN at 5 dB SNR for two Noizeus utterances (sp10 andsp17).Fig. 4: Speech enhancement results for the objective experimentdetailed in Section 3.3.3. The results are in terms of mean PESQscores as a function of input SNR (dB) for AWGN over the Noizeuscorpus.annoying sounds referred to as the musical noise. This isclearly visible in the SpecSub spectrogram of Fig. 3(c).On the other hand, the proposed method subtractsthe modulation magnitude spectrum estimate of thenoise from the modulation magnitude spectrum of thenoisy speech along each acoustic frequency bin. Whilesome spectral magnitude variation is still present in theresulting acoustic spectrum, the residual peaks have muchsmaller magnitudes. As a result, ModSpecSub stimulido not suffer from the musical noise audible in SpecSubstimuli (given a proper selection of modulation frameduration as discussed in Appendix A). This can be seenby comparing spectrograms in Fig. 3(c) and Fig. 3(e).The MMSE method does not suffer from the problemof musical noise (Cappe, 1994; Loizou, 2007), however,it does not suppress background noise as effectively asthe proposed method. This can be seen by comparingspectrograms in Fig. 3(d) and Fig. 3(e). In addition,listeners found the residual noise present after MMSEenhancement to be perceptually distracting. On the otherhand, the proposed method uses larger frame durations inorder to avoid musical noise (see Appendix A). As a result,stationarity has to be assumed over a larger duration. Thiscauses temporal slurring distortion. This kind of distortionis mostly absent in the MMSE stimuli constructed withsmoothing factor α set to 0.98. The need for longerframe durations in the ModSpecSub method also meansthat larger non-speech durations are required to updatenoise estimates. This makes the proposed method lessadaptive to rapidly changing noise conditions. Finally, theadditional processing involved in the computation of themodulation spectrum for each acoustic frequency bin, addsto the computational expense of the ModSpecSub method.In the next section, we propose to combine ModSpecSuband MMSE algorithms in the acoustic STFT domain inorder to reduce some of their unwanted effects and toachieve further improvements in speech quality.We would also like to emphasise that the phase spectrumnot prefer one stimulus over the other. Pairwise scoringwas employed, with a score of 1 awarded to the preferedmethod and 0 to the other. For a similar preferenceresponse each method was awarded a score of 0.5. Theparticipants were allowed to re-listen to stimuli if required.The responses were collected via keyboard. No feedbackwas given.3.4. Results and discussionThe results of the objective experiment, in terms ofmean PESQ scores, are shown in Fig. 4. The proposedmethod performs consistently well across the SNR range,with particular improvements shown for stimuli with lowerinput SNRs. The MMSE method showed the next bestperformance, with all enhancement methods achievingcomparable results at 15 dB SNR.The results of the subjective exp

modulation spectral subtraction with the MMSE method. The fusion is performed in the short-time spectral domain by combining the magnitude spectra of the above speech enhancement algorithms. Subjective and objective evaluation of the speech enhancement fusion shows consistent speech quality improvements across input SNRs. Key words: Speech .

Related Documents:

Speech enhancement based on deep neural network s SE-DNN: background DNN baseline and enhancement Noise-universal SE-DNN Zaragoza, 27/05/14 3 Speech Enhancement Enhancing Speech enhancement aims at improving the intelligibility and/or overall perceptual quality of degraded speech signals using audio signal processing techniques

component for speech enhancement . But, recently, the [15] phase value also considered for efficient noise suppression in speech enhancement [5], [16]. The spectral subtraction method is the most common, popular and traditional method of additive noise cancellation used in speech enhancement. In this method, the noise

speech enhancement based on the short-time spectral magnitude (STSM). In real processing speech enhancement techniques, the algorithm employed a simple principle in which the spectrum of the clean speech estimation signal can be obtained by subtracting a noise estimation spectrum from the noisy speech spectrum conditions.

channel speech enhancement in the time domain. Traditional monaural speech enhancement approaches in-clude spectral subtraction, Wiener filtering and statistical model-based methods [1]. Speech enhancement has been extensively studied in recent years as a supervised learning This research was supported in part by two NIDCD grants (R01DC012048

Modified Amplitude Spectral Estimator for Single-Channel Speech Enhancement Zhenhui Zhai1,b, Shifeng Ou1,a, Ying Gao1,c 1 School of Opto-electronic Information Science and Technology, Yantai University, Yantai, 264005, China aemail: ousfeng@126.com, bemail:zhaizhenhui_2008@163.com, cemail:claragaoying@126.com Keywords: Speech Enhancement; Amplitude Spectral Estimation; Decision-Directed; Soft .

Speech Enhancement Speech Recognition Speech UI Dialog 10s of 1000 hr speech 10s of 1,000 hr noise 10s of 1000 RIR NEVER TRAIN ON THE SAME DATA TWICE Massive . Spectral Subtraction: Waveforms. Deep Neural Networks for Speech Enhancement Direct Indirect Conventional Emulation Mirsamadi, Seyedmahdad, and Ivan Tashev. "Causal Speech

2 The proposed BDSAE speech enhancement method In this section, we first present conventional spectral ampli-tude estimation scheme for speech enhancement. Then, the proposed speech enhancement scheme based on Bayesian decision and spectral amplitude estimation is described. Finally, we derive the optimal decision rule and spectral

Organizational Behavior 5 Nature of Organization Nature of organization states the motive of the firm. It is the opportunities it provides in the global market. It also defines the employees’ standard; in short, it defines the character of the company by acting as a mirror reflection of the company. We can understand the nature of any firm with its social system, the mutual interest it .