Speech Enhancement Using Spectral Subtraction - IOSR-JEN

1y ago
6 Views
2 Downloads
660.92 KB
6 Pages
Last View : 2d ago
Last Download : 3m ago
Upload by : Wade Mabry
Transcription

IOSR Journal of EngineeringMar. 2012, Vol. 2(3) pp: 376-381Speech Enhancement using Spectral SubtractionN. Siddiah1, T.Srikanth2 and M. Venkatesh Varma31Dept. of E.C.E., Mekapati Raja Mohana Reddy Institute of Technology & Science, Udyagiri, A.P., India.2Dept. of E.C.E., Kallam Haranadha Reddy Institute of Technology, Guntur, A.P., India.3Dept. of E.C.E., Chalapathi Institute of Technology, Guntur, A.P., India.AbstractThis paper address several problems associated with Automatic Speech Recognition Systems (ASR) and study a speech enhancementtechnique that could possibly reduce the inefficiencies that ASR systems encounter. Spectral Subtraction (SS) is a method used to reducethe amount of noise acoustically added in the speech signals. Our goal is to implement the SS algorithm to provide speech enhancementwhile researching Automatic Speech Recognition, to discover whether SS can enhance the efficiency of ASR systems. SpectralSubtraction is an algorithm designed to reduce the degrading effects of noise acoustically added in speech signals.This paper focuses on the removal of white noise in speech signals, and attempts to explain how SS can improve ASR systems.The importance of the SS method over other methods is also explored. As our day-to-day lives become more complicated, ASR provides ahands free way to complete a variety of duties by simply speaking. Used effectively ASR systems can optimize most tasks and allow auser to complete them at a rate that is substantially faster. In addition these systems can enhance the way the hearing impairedcommunicate, improve security, and can provide authentication for many applications. For these and many more reasons, there is anobvious requirement for adequate ASR systems and their integration into our everyday life.Keywords: Speech Enhancement, Speech Recognition, Spectral Subtraction, Windowing techniques, Noise reduction.I. INTRODUCTIONMany systemsrelyon automatic speechrecognition (ASR) to carry out their required tasks. Usingspeech as its input to perform certain tasks, it is important toensure that background noise will not degrade theperformance of systems or ultimately completely inhibited.Spectral Subtraction (SS) is an algorithm designed toreduce the degrading effects of noise acoustically addedin speech signals. With applications from speech andlanguage development in young children to aidingindividuals with hearing impairments ASR is becomingincreasingly popular and the demand for efficient systems ismore evident. While humans are the best examples of ASR,the term as we know it usually means the process inwhich a computer recognizes and/or identifies spokenwords. Not with standing any task that involves interfacingwith a computer can potentially use ASR, the followingapplications arethe mostcommon right now:Dictation, Commandand Control,Mobile, PersonalAccessories and medical or disability.The spectral subtraction algorithm is historically one ofthe first algorithms proposed for noise reduction [1, 2],and is perhaps one of the most popular algorithms. It isbased on a simple principle. Assuming additive noise, onecan obtain an estimate of the clean signal spectrum bysubtracting an estimate of the noise spectrum ofthe noisy speech spectrum. The noise spectrum can beestimated and updated during periods when the signal isabsent. The enhancedsignal isobtainedbycalculating the inverse discreteFouriertransform spectrum of the signal estimated by the phase ofthe signalwith noise. Thealgorithm iscomputationally simple, since it only involves a single stepforward and inverse Fourier transform.The simple subtraction processing comes with a price.The subtraction process needs to be done carefully to avoidISSN: 2250-3021any speech distortion. If too much is subtracted, then somespeech information might be removed, while if too little issubtracted then much of the interfering noise remains. Manymethods have been proposed to alleviate, and in some cases,eliminate some of the speech distortion introduced by thespectral subtraction process [3]. Some suggested oversubtracting estimates of the noise spectrum and spectralflooring (rather than setting to zero) negative values [4].Others suggested dividing the spectrum into a fewcontiguous frequency bands and applying different nonlinear rules in each band [5, 6]. Yet, others suggested usinga psychoacoustical model to adjust the over-subtractionparameters so as to render the residual noise inaudible [7].The derivation of the equations spectral subtraction isbased on the assumption that the cross terms involving thephase difference between signals clean and noise are zero.The cross terms is assumed to be zero because the speechsignal isuncorrelated withnoise interference. Several attempts have been made to takeinto account or other wise compensate the cross-terms [8,9, 10], spectral subtraction. The study in [10] evaluated theeffect of neglecting the cross terms on the performanceof speech recognition.This paper focuses on present day problems associatedwith speech recognition, especially the removal of whitenoise in speech signals, and attempts to explain how SS canimprove ASR systems. White noise is a type of noise that isproduced by combining sounds of all different frequenciestogether. Because it contains all frequencies white noisecan drown or mask other sounds, which may containsignificant information, needed for input into an ASRsystem. If a reasonable estimate of white noise contained ina given speech signal can be obtained and removed from asignal, then we should see an improvement in the quality ofthe speech and efficiency for most ASR systems.www.iosrjen.org376 P a g e

IOSR Journal of EngineeringMar. 2012, Vol. 2(3) pp: 376-381Other methods used to reduce the amount of noisein speech signals include: Noise cancelling microphones,although essential for extremely high noise environmentssuch as the helicopter cockpit, they offer little or no noisereduction above 1 kHz.Another and one of the most efficient techniques toimprove robustness of speech recognition systems onadditive noise consists in training the acoustic models withdata corrupted by noise at different signal-to-noise ratios(SNR). However as it is stated this method requires trainingby individuals in different environments, which may or maynot be available in all situation.II. SPEECH ENHANCEMENT TECHNIUESSpeech enhancement (SE),i.e,ways that a speech ise,interfering talkers, bandlimiting), can be processedto increase its intelligibility (the likelihood of beingcorrectly understood) and/or its quality.There are three classes of SE methods, each withits own advantages and limitations:1.2.3.Harmonic FilteringParametric ResynthesisSpectral SubtractionC. Spectral Subtraction (SS):Spectral subtraction (SS) is an algorithm which isused to reduce the amount of noise acoustically added in thespeech signal. In this method we subtract the noise powerspectrum from noisy signal power spectrum.In the caseof negativesignal-to-noiseratio(SNR)(i.e., more energy in the interference than in thedesired speech),this method works well for both generalnoise and interfering speakers, although musical tone ornoise artifacts often occur at frame boundaries in suchreconstructed speech. SS generally reduces noise power(improving quality), but often reduces intelligibility(especially in low SNR situations), due to suppression ofweak portions of speech (e.g., high frequency formats andunvoiced speech).Segmenting the DataThe data from the signal are segmented andwindowed, such that if the sequence is separated into halfoverlapped data buffers, then the sum of these windowedsequences adds back up to the original sequence. 10mswindows of data were used in this analysis.Windowing is the multiplication of a speech signal S(n)by a window W(n), which yields a set of speech samplesX(n) weighted by a shape of window.WhereA. Harmonic FilteringThis method works only for voiced speech, requiresan Fo estimate, and suppresses spectral energy betweendesired harmonics.The harmonic SE method attempts to identify the Fo(and hence harmonics) either of the desired speech or ofinterfering sources. If the desired sound is the strongestcomponent in the signal, its frequencies can be identifiedand other frequencies may then be suppressed; otherwise astrong interfering sound’s frequencies can be identified andsuppressed, with the remaining frequencies presumablyretaining some of the desired speech source. Such simpleweiner filtering (suppressing wide band noise betweenharmonics) improves SNR but has little effect onintelligibility.S(n) speech signalW(n) windowing functionX(n) Noisy speech signalW(n) may have an infinite duration but most practicalwindows have finite length to simplify computation. Manyapplications prefer some speech averaging, to yield anoutput parameter contour (vs. time) that represents someslowly varying physiological aspects of vocal tractmovements.Types of Windows:B. Parametric Resynthesis:This method adopts a specific speechproduction model (e.g., from low-rate coding), andreconstructs a clean speech signal based on the model, usingparameter estimates from the noisy speech.The parametric resynthesis SE methodimproves speech signals by parametric estimation andspeech resynthesis. Speech synthesizers generate noise-freespeech from parametric representations of either a vocaltract model or previously analysed speech.Most synthesizers employ separate representationsfor vocal tract shape and excitation information, coding theformer with about 10 spectral parameters and coding thelater with estimates of intensity and periodicity (e.g. Fo).Such synthesis suffers from the same mechanical quality asfound in low-rate speech coding and from degradedparameter estimate (due to noise).ISSN: 2250-3021www.iosrjen.org1.Hamming window2.Hanning window3.Kaiser window377 P a g e

IOSR Journal of EngineeringMar. 2012, Vol. 2(3) pp: 376-381X(k)Hanning WindowFTFig. 2: Shape of Hamming WindowCompute MagnitudeHanning WindowSubtract BiasHalf wave RectifyResidual Noise ReductionFig. 3: Shape of Hanning WindowCompute Speech Activity DetectorWhn(n) 0.5 0.5cos(2Πn/N-1),-(N-1)/2 n (N-1)/2 0,Otherwise.Attenuate Signal During Non-Speech ActivityKaiser WindowFor Kaiser windowIFFTtheattenuationcoefficient0,5.4414,8.885. Ŝ Fig. 1: Flow chart for spectral subtractionAt α 0,the output 1,kaiser becomes rectangularwindow.At α 5.4414,kaiser becomes hamming window.At α 8.885, kaiser becomes Blackmann window.Hamming WindowFor Hamming window the attenuation coefficient (α)is 0.54.At low frequencies the stop band attenuation is high,so the ripples presented in stop band is more whencompared to hanning window. The hamming window resultsin both pass band and stop band of the filter.Whm(n) 0.54 0.46cos(2Πn/N-1),-(N-1)/2 n (N-1)/2 0,otherwiseFor Hanning window the attenuation coefficient(α) is 0.5.Athigh frequencies, the stop band attenuation is high, and atlow frequencies, the stop band attenuation is low, so theripples presented in stop band also easy to eliminate whencompared to Hamming and Kaiser.Fig. 4: Shape of Kaiser WindowFourier TransformLet a windowed speech signal and noise signalbe represented by s(k) and n(k) respectively. The sum of thetwo is then denoted by x(k),x(k) s(k) n(k)Taking the Fourier Transform of both sides givesISSN: 2250-3021www.iosrjen.org(1)378 P a g e

IOSR Journal of EngineeringMar. 2012, Vol. 2(3) pp: 376-381𝑋 𝑒 𝑗𝜔 𝑆 𝑒 𝑗𝜔 𝑁 𝑒 𝑗𝜔(2)Where𝑋 𝑒 𝑗𝜔x(k)𝑋 𝑒 𝑗𝜔 𝐿 1 𝑗𝜔𝑘𝐾 0 𝑒(3)Compute Noise Spectrum MagnitudeTo obtain the estimate of the noise spectrum the magnitudeN(𝑒 𝑗𝜔 ) of N(𝑒 𝑗𝜔 ) is replaced by its average value µ(𝑒 𝑗𝜔 )taken during the regions estimated as “noise only”. For thisanalysis the first 50ms were used as the “noise-only”.The phase θN( 𝑒 𝑗𝜔 ) of N( 𝑒 𝑗𝜔 ) is replaced by the phaseθx(𝑒 𝑗𝜔 ) Of X (𝑒 𝑗𝜔 ) , due to the fact that the two signals areassumed to have the delay.Through manipulation and substitution of equation (2)we obtain the spectral subtraction estimator Ŝ(𝑒 𝑗𝜔 ):Ŝ(𝑒 𝑗𝜔 ) [ X(𝑒 𝑗𝜔 ) -µ(𝑒 𝑗𝜔 )]𝑒 𝑗𝜃𝑥 (𝑒𝑗𝜔 )(4)The error that results from this estimator is givenbyε(𝑒 𝑗𝜔 ) Ŝ(𝑒 𝑗𝜔 ) - S(𝑒 𝑗𝜔 ) N(𝑒 𝑗𝜔 - µ(𝑒 𝑗𝜔 )𝑒 𝑗𝜃 )(5)In efforts to reduce this error local averaging isused because ε(𝑒 𝑗𝜔 ) is simply the difference betweenN(𝑒 𝑗𝜔 ) and its mean µ. Therefore X(𝑒 𝑗𝜔 ) is replacedwith 𝑋(𝑒 𝑗𝜔 ) Where𝑀 1𝑋𝑒 𝑗𝜔𝑋𝑖 (𝑒 𝑗𝜔 ) 𝑖 0𝑋𝑖 (𝑒 𝑗𝜔 ) ith time-windowed transform of x(k).By substitution in equation (4) we have𝑒 𝑗𝜔𝑆𝐴 𝑒 𝑗𝜔 𝑋 𝑒 𝑗𝜔 𝜇 𝑒 𝑗𝜔 𝑒 𝑗𝜃𝑥(6)output at these frequencies is set to zero. This is half-waverectification.The advantage of half-wave rectification is thatthe noise floor is reduced by µ(𝑒 𝑗𝜔 ). When the speech plusthe noise is less than µ( 𝑒 𝑗𝜔 ) this leads to an incorrectremoval of speech information and a possible decrease inintelligibility.Residual Noise ReductionWhile half-wave rectification zeros out the speechplus noise that is less than µ(𝑒 𝑗𝜔 ),speech plus noise aboveµ(𝑒 𝑗𝜔 ) still remain. When there is no speech present in agiven signal the difference between N and μ𝑒 𝑗𝜃𝑛 is callednoise residual and will demonstrate itself as disorderlyspaced narrow bands of magnitude spikes. Once the signal istransformed back into the time domain, these disorderlyspaced narrow bands of magnitude spikes will sound likethe sum of tone generators with random frequencies.This is a phenomenon known as the .musicalnoise effect. Because the magnitude spikes fluctuate fromframe to frame, we are able to reduce the audible effects ofthe noise residual by replacing the current values from eachframe with the minimum values chosen from the adjacentframes.The motivation behind this replacementscheme is threefold: first, if the amplitude of Ŝ(𝑒 𝑗𝜔 ) liesbelow the maximum noise residual, and it varies radicallyfrom analysis frame to frame, then there is a highprobability that the spectrum at that frequency is due tonoise; therefore, suppress it by taking the minimum value;second if Ŝ(𝑒 𝑗𝜔 ) lies below the maximum but has a nearlyconstant value, there is a high probability that the spectrumat that frequency is due to low energy speech; therefore,taking the minimum will retain the information; and third, ifŜ(𝑒 𝑗𝜔 ) is greater than the maximum, there is speech presenta that that frequency; therefore, removing the bias issufficient.Residual Noise Reduction is implemented as:𝑆𝑖 𝑒 𝑗𝜔 𝑆𝑗 𝑒 𝑗𝜔𝑓𝑜𝑟 𝑆𝑖 𝑒 𝑗𝜔 max ( 𝑁𝑅 𝑒 𝑗𝜔 )The spectral error is now approximatelyAttenuate Signal during Non-Speech Activity𝑗𝜔𝜖 𝑒Where𝑁 𝑒 𝑗𝜔𝑗𝜔 𝑆𝐴 𝑒 1𝑀𝑗𝜔 𝑆 𝑒𝑗𝜔 𝑁 𝑒𝑗𝜔 𝜇(𝑒 )(7)The amount of energy in Ŝ(𝑒 𝑗𝜔 ) comparedto µ(𝑒 ) supplies an indication of the presence of speechactivity contained inside a given analysis frame. Empirically,it was determined that the average (before versus after)power ratio was down at least 12dB .This offered an estimate for detecting the absence of speechgiven by:𝑗𝜔𝑀 1𝑗𝜔𝑖 0 𝑁𝑖 (𝑒 )Thus, the sample mean of N (𝑒 𝑗𝜔 ) will converge toas a longer average is taken.µ(𝑒 𝑗𝜔 )It has also been noted that averaging over more thanthree half-overlapped frames, will weaken intelligibility.The reason for this is because the noise magnitude estimatehas been assumed to stay constant throughout and byunderestimating we take less risk of removing any importantspeech information.Half-Wave RectificationFor frequencies where 𝑋 𝑒 𝑗𝜔 is less than µ(𝑒 𝑗𝜔 ),the estimator Ŝ(𝑒 𝑗𝜔 ) will become negative, therefore theISSN: 2250-3021𝑇 10 𝑙𝑜𝑔1012𝜇𝜋 𝜋Ŝ 𝑒 𝑗𝜔µ 𝑒 𝑗𝜔𝑑𝜔If T was less than -12dB for a particular frame, it wasclassified as having no speech and attenuated by a factor c,where20 𝑙𝑜𝑔10 𝐶 30dB. -30dB, was found to be areasonable, but not optimum amount of attenuation.www.iosrjen.org379 P a g e

IOSR Journal of EngineeringMar. 2012, Vol. 2(3) pp: 376-381The output of the spectral estimate including signalattenuation is given by:Ŝ 𝑒 𝑗𝜔 𝑇 12𝑑𝑏Ŝ 𝑒 𝑗𝜔 𝐶𝑋 𝑒 𝑗𝜔 𝑇 12𝑑𝑏III.RESULTSFig. 8: Time waveform of Enhanced speech content.This is the enhanced single speech content &is the output ofSpectral Subtraction when we apply the input for SS as thesingle speech content. For this signal time is taken in msacross X-axis and amplitude is taken in mv across Y-axis &the duration of this signal is 10ms.Fig. 5: Time waveform of speech utterance .This is the input signal which is ready for enhancing itsfrequency range is typically 300Hzs to 4000 Hzs which is inbetween audio range. It has 5 speech contents with 50mstime duration. For this signal time is taken in (ms) across Xaxis and amplitude is taken in (mv) across Y-axis.Fig. 9: Spectrum of Noisy speech signal 𝑋 𝑒 𝑗𝜔Spectrum is the waveform of the signal magnitudewith respect to frequency. The above signal is the spectrumof input noisy signal, it was obtained after FourierTransform. For this signal frequency is taken in Hzs acrossX-axis and magnitude is taken in dBs across Y-axis.Fig. 6: Average noise magnitude of speech utteranceFig. 10: Avg. noise magnitude µ(𝑒 𝑗𝜔 )Fig. 7: Time waveform of speech utterance for single noisy speechcontent.This is the single noisy speech content which istaken from the input noisy signal for enhancing. Beforeapplying the total input signal to the spectral subtraction ,itis the example application of enhancing process. The totalduration of this speech content is 10ms.ISSN: 2250-3021This is the spectrum of spectral estimator which is estimatedfrom noisy speech signal. . For this signal frequency is takenin Hzs across X-axis and magnitude is taken in dBs acrossY-axis.www.iosrjen.org380 P a g e

IOSR Journal of EngineeringMar. 2012, Vol. 2(3) pp: 376-381IV.COCLUSIONSS, a noise removal algorithm has been successfullyimplemented and tested. Sufficient estimates of noisespectra were determined from initially noisy speech signalsand effectively removed throughout the signal to produce anenhanced speech signal. Overall the results display aconsiderable improvement in the quality of speech signals,which should increase the performance in ASR recognitionsystems.REFERENCES[1].Fig. 11: Spectrum of Enhanced speech Ŝ (𝑒 𝑗𝜔 )Fig. 12: Time wave form of speech utterance after SS.The above signal is the speech utterance after baisremoval, half wave rectification, frame averaging ,residualnoise reduction , non-speech activity and signalreconstruction. For this signal time is taken in ms across Xaxis and amplitude is taken in mv across Y-axis.Suppression of acoustic noise in speech using spectralsubtraction. IEEE Trans. Acoust. Speech SignalProcess. ASSP-27 (2), 113–120.[2]. Study and the development of the INTEL techniquefor improving speech intelligibility. Technical ReportNSC-FR/4023, Nicolet Scientific Corporation.[3]. Speech Enhancement: Theory and Practice. CRCPress LLC, Boca Raton, FL.[4]. Enhancement of speech corrupted by acoustic noise.In: Proc. IEEE Internat. Conf. on Acoustics, Speech,and Signal Processing, pp. 208–211.[5]. A multi-band spectral subtraction method forenhancing speech corrupted by colored noise. In:Proc. IEEE Internat. Conf. on Acoustics, Speech, andSignal Processing.[6]. Experiments with a Non-linear Spectral Subtractor(NSS) Hidden Markov Models and the projections forrobust recognition in cars. Speech Commun 11 (2–3),215–228.[7]. Single channel speech enhancement based onmasking properties of the human auditory system.IEEE Trans. Speech Audio Process. 7 (3), 126–137.[8]. Improving performance of spectral subtraction inspeech recognition using a model for additive noise.IEEE Trans. Speech Audio Process. 6 (6), 579–582.[9]. Evaluation of spectral subtraction with smoothing oftime direction on the AURORA 2 task. In: Proc.Internat. Conf. Spoken Language Processing, pp.477–480.[10]. An assessment on the fundamental limitations ofspectral subtraction. In: Proc. IEEE Internat. Conf. onAcoustics, Speech, Signal Processing, Vol. I. pp.145–148.Fig. 13: waveforms for hamming and rectangular windowtechniquesThe enhanced speech signals were played back anddemonstrated considerable improvement from the originalsignals. Some tribulations encountered during thisimplementation were discovered during the speech activitydetector step, the algorithm only detected the first five andlast two frames as having no speech, all other frames werefound to contain speech information. This is an extremelylow number of frames to be classified as no speech and wasquite unexpected. In addition, due to randomly spacednarrow bands of noise residual, the final results exhibitedthe phenomenon known as the musical noise effect.ISSN: 2250-3021www.iosrjen.org381 P a g e

Keywords: Speech Enhancement, Speech Recognition, Spectral Subtraction, Windowing techniques, Noise reduction. I. INTRODUCTION M any systems rely on automatic speech recognition (ASR) to carry out their required tasks. Using speech as its input to perform certain tasks, it is important to

Related Documents:

Power Spectral Subtraction which itself creates a bi-product named as synthetic noise[1]. A significant improvement to spectral subtraction with over subtraction noise given by Berouti [2] is Non -Linear Spectral subtraction. Ephraim and Malah proposed spectral subtraction with MMSE using a gain function based on priori and posteriori SNRs [3 .

speech enhancement such as spectral subtraction methods, MMSE methods, Weiner algorithm etc. [2]. This paper attempts the Boll's Spectral Subtraction method of Speech Enhancement [3]. In this Method, the noisy speech signal is partitioned into frames. Each frame is multiplied by a window function prior to the

modulation spectral subtraction with the MMSE method. The fusion is performed in the short-time spectral domain by combining the magnitude spectra of the above speech enhancement algorithms. Subjective and objective evaluation of the speech enhancement fusion shows consistent speech quality improvements across input SNRs. Key words: Speech .

Multiband spectral subtraction was proposed by Kamath [4]. It is very hard for any speech enhancement algorithms to perform homogeneously over all noise types. For this reason algorithms are built on certain assumptions. Spectral subtraction algorithm of speech enhancement is built under the assumption that the noise is additive and is

coefficient) perturbation. Various speech enhancement techniques have been considered here such as spectral subtraction, spectral over subtraction with use of a spectral floor, spectral subtraction with residual noise removal and time and frequency domain adaptive MMSE filtering. The speech signal sued here for recognition experimentation was

Speech Enhancement using Spectral Subtraction Suma. M. O1, Madhusudhana Rao. D2, Rashmi. H. N3 4& Manjunath B. S 1&3Dept. ECE, RGIT, Bengaluru, 2U.G Consultants, Bengaluru . Spectral subtraction algorithm is used for removing only for the white noise and multi band spectral

speech enhancement techniques, DFT-based transforms domain techniques have been widely spread in the form of spectral subtraction [1]. Even though the algorithm has very . spectral subtraction using scaling factor and spectral floor tries to reduce the spectral excursions for improving speech quality. This proposed

speech enhancement using spectral subtraction is shown in Fig. 1. It involves windowing, FFT, noise spectrum estima-tion, spectral subtraction, complex spectrum calculation, and resynthesis using IFFT with overlap-add. Windowed frames of the noisy speech signal x(n) are given to a FFT block to find magnitude and phase spectra.