A Fully Convolutional Neural Network Approach To End-to-End Speech .

1y ago

6 Views

2 Downloads

5.60 MB

95 Pages

Last View : 1m ago

Last Download : 3m ago

Upload by : Camryn Boren

Report this link

Download PDF

Transcription

THE COOPER UNIONFOR THE ADVANCEMENT OF SCIENCE AND ARTALBERT NERKEN SCHOOL OF ENGINEERINGA Fully Convolutional Neural Network Approachto End-to-End Speech EnhancementbyFrank LongueiraA thesis submitted in partial fulfillmentof the requirements for the degree ofMaster of EngineeringApril 16, 2018Professor Sam Keene, Advisor

THE COOPER UNIONFOR THE ADVANCEMENT OF SCIENCE AND ARTALBERT NERKEN SCHOOL OF ENGINEERINGThis thesis was prepared under the direction of the Candidate’s Thesis Advisor andhas received approval. It was submitted to the Dean of the School of Engineeringand the full Faculty, and was approved as partial fulfillment of the requirementsfor the degree of Master of Engineering.Dean, School of EngineeringProfessor Sam KeeneThesis AdvisorDateDate

AcknowledgmentsThank you to Professor Sam Keene, for his inspiration, guidance, and support asadvisor to this endeavor.Thank you to Matthew Smarsch, for being a steadfast partner throughout myacademic years and now co-worker. See you at work on Monday.Thank you to Christopher Curro, for his inspiration, support, and vast knowledgein the field of deep learning.Thank you to The Cooper Union’s Electrical Engineering & Mathematics Departments, for providing me with a logical framework for maneuvering through life andthe desire to teach others what has been taught to me.Thank you to my family, for being a constant source of support and encouragementthroughout my life.Thank you to Starbucks, for their co ee, Wi-Fi, and unlimited refills.Thank you to Peter Cooper, for his open mind, practicality, and generosity thathas given myself and many others the opportunity to study free of financial burden.His life has provided me with a model for rising to intellectual, financial, and socialprominence from humble means.i

AbstractSpeech enhancement seeks to improve the quality of speech degraded by noise. Itsimportance can be found in applications such as mobile phone communication,speech recognition, and hearing aids. An example of speech enhancement relatesto the famous cocktail party problem. This problem deals with extracting a targetspeaker’s voice from a mixture of background conversations. In such a situation,the human brain tends to do a good job focusing in on the target speech whileblocking out the noisy environment surrounding it. The goal of solving the cocktailparty problem is to find a computer algorithm that functionally mimics how thebrain extracts the target speaker’s voice. In this master’s thesis, a novel approachto solving the cocktail party problem is presented that relies on a fully convolutionalneural network (FCN) architecture. The FCN takes noisy, raw audio data as inputand performs nonlinear, filtering operations to produce clean, raw audio data ofthe target speech at the output. Results from experimentation indicate the abilityto generalize to new speakers and robustness to new noise environments of varyingsignal-to-noise ratios.ii

Contents1 Introduction12 Background52.12.22.3Speech & Signal Processing Fundamentals . . . . . . . . . . . . . .52.1.1Basics of Speech. . . . . . . . . . . . . . . . . . . . . . . .52.1.2Time-Dependent Fourier Analysis . . . . . . . . . . . . . . .72.1.3Signal-to-Noise Ratio (SNR) . . . . . . . . . . . . . . . . . .82.1.4Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.1.5Overlap-Add Method of Reconstruction . . . . . . . . . . . .9Traditional Speech Enhancement Methods . . . . . . . . . . . . . . 102.2.1Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . .112.2.2Wiener Filter . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.3Ideal Binary Mask Estimation . . . . . . . . . . . . . . . . . 162.2.4Performance Evaluation Measures (PESQ, STOI, WER) . . 17Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .212.3.1Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . .212.3.2Example: Linear Regression . . . . . . . . . . . . . . . . . .21iii

2.42.3.3Unsupervised v.s. Supervised Learning . . . . . . . . . . . . 232.3.4Overfitting v.s. Underfitting . . . . . . . . . . . . . . . . . . 252.3.5Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3.6Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . 292.3.7Principle of Maximum Likelihood . . . . . . . . . . . . . . . 302.3.8Bias-Variance Tradeo . . . . . . . . . . . . . . . . . . . . .2.3.9Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . 3431Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.4.1Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.4.2Deep Feedforward Networks . . . . . . . . . . . . . . . . . . 362.4.3Convolutional Neural Networks . . . . . . . . . . . . . . . . 392.4.4Gradient-based Optimization . . . . . . . . . . . . . . . . .2.4.5Regularization & Early Stopping . . . . . . . . . . . . . . . 442.4.6Batch Normalization . . . . . . . . . . . . . . . . . . . . . . 463 A Fully Convolutional Neural Network Approach41473.1Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.2System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.3Testing Generalization on the Same Speaker . . . . . . . . . . . . . 593.4Testing Generalization on a New Speaker . . . . . . . . . . . . . . .614 Conclusions & Future Work66References67A System Design: Top 13 FCN Architectures71iv

B Python Code78B.1 audio preprocessing.py . . . . . . . . . . . . . . . . . . . . . . . . . 78B.2 cnn model.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80B.3 main.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83v

List of Figures2.1Spectrogram of the spoken words “nineteenth century” [19]2.2A diagram of a general LTI system [22] . . . . . . . . . . . . . . . . 132.3A diagram depicting the PESQ model [25] . . . . . . . . . . . . . . 182.4A diagram depicting the STOI model [26] . . . . . . . . . . . . . . 192.5An example WER alignment and calculation [27] . . . . . . . . . . 202.6An example of underfitting/overfitting [28] . . . . . . . . . . . . . . 272.7Varying2.8Graph of the rectified linear unit (ReLU) [28] . . . . . . . . . . . . 382.9An example of a convolutional layer [28] . . . . . . . . . . . . . . . 40and its e ect on the model that is fit [28]. . . .8. . . . . . . . 282.10 Plot of learning curves showing early stopping [28] . . . . . . . . . . 453.1Relationship between network depth and validation loss . . . . . . . 543.2Clean (LEFT), noisy (CENTER), and filtered (RIGHT) spectrograms of 10 seconds of the new speaker’s speech at 0 dB. . . . . . . 65vi

List of Tables3.1Data collection & splitting . . . . . . . . . . . . . . . . . . . . . . . 503.2Relationship between number of filters and validation loss . . . . . . 533.3PESQ & WER for top 13 FCN architectures based on validation loss 563.4Performance of Models #53 and #71 across 0 dB and -5 dB . . . . 573.5A layer-by-layer description of Model #53’s FCN architecture . . . 583.6PESQ of speech enhancement system tested on the same speakeracross multiple SNRs . . . . . . . . . . . . . . . . . . . . . . . . . . 603.7WER of speech enhancement system tested on the same speakeracross multiple SNRs . . . . . . . . . . . . . . . . . . . . . . . . . . 603.8PESQ of speech enhancement system trained on one speaker andtested on a new speaker across multiple SNRs . . . . . . . . . . . . 623.9WER of speech enhancement system trained on one speaker andtested on a new speaker across multiple SNRs . . . . . . . . . . . . 623.10 PESQ of speech enhancement system trained on one speaker, finetuned on a new speaker, and tested on that new speaker acrossmultiple SNRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63vii

3.11 WER of speech enhancement system trained on one speaker, finetuned on a new speaker, and tested on that new speaker acrossmultiple SNRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64viii

Chapter 1IntroductionOne of the largest issues facing hearing impaired individuals in their day-to-daylives is accurately recognizing speech in the presence of background noise [1].While modern hearing aids do a good job of amplifying sound, they do not doenough to increase speech quality and intelligibility. This is not a problem in quietenvironments, but a standard hearing aid that simply amplifies audio will fail toprovide the user with a signal they can easily understand when the user is in anoisy environment [2]. The problem of speech intelligibility is even more difficultif the background noise is also speech, such as in a bar or restaurant with manypatrons.While people without hearing impairments usually have no trouble focusingon a single speaker out of multiple, it is a much more difficult task for peoplewith a hearing impairment [3]. The problem of picking out one person’s speechin an environment with many speakers was dubbed the cocktail party problem ina paper by Colin Cherry, published in 1953 [4]. The paper asserts that humansare normally capable of separating multiple speakers and focusing on a single1

one. However, hearing impaired individuals may have issues when it comes toperforming this same task. A solution to the cocktail party problem would be analgorithm that a computer can employ in real-time to enhance the speech corruptedby babble (background noise from other speakers). Traditionally, the cocktail partyproblem has been approached using several di erent techniques, such as usingmicrophone arrays, monaural algorithms involving signal processing techniques,and Computational Auditory Scene Analysis (CASA) [1].Modern hearing aids incorporate the microphone array strategy. They usebeamforming to amplify sound coming from a specific direction (the simplestalgorithms assume directly in front of the user) and attenuate the sound comingfrom elsewhere [5]. This technique comes with several drawbacks. In order for it towork, the speech the user is trying to focus on must come from a di erent directionthan the noise. Difficulty will also arise when the source of the speech changeslocation.Monaural algorithms use a single microphone and so are not dependent onthe location of the speech source and the noise. These algorithms attempt toestimate the clean speech signal after a statistical analysis of the speech and noise.Traditional monaural algorithms include spectral subtraction and Wiener filtering[8] - [9] . Spectral subtraction removes the estimated power spectral density of thenoise signal from the power spectral density of the noisy speech. Wiener filteringestimates the clean speech signal by employing an optimal LTI filter in the meansquared error sense based on stochastic process assumptions on the noisy inputsignal. If the background noise is also speech, as in the cocktail party problem,these types of filtering techniques have difficulty extracting the target speech. This2

difficulty arises due to speech of di erent human speakers occupying overlappingfrequency ranges in the frequency domain. While traditional monaural strategieshave been shown to improve speech quality, they have struggled with improvingspeech intelligibility for human listeners [6].Computational Auditory Scene Analysis (CASA) has some promising resultsusing ideal binary time-frequency masks to hide regions of the speech mixturewhere the SNR is below a certain threshold [7]. However, this method of separatingspeech from noise requires prior knowledge of both, as the mask is created basedo of the relative strengths of the speech signal and the noise. This strategy alsofaces difficulty if the noise and target speech occupy similar frequency ranges as isthe case with babble noise.More recent studies in speech enhancement related to the cocktail party problem fall in the domain of deep learning. With the advent of big data, more memory,and increased processing power, deep learning has completely revolutionized manydomains such as speech recognition and object recognition. Deep neural networksare able to learn complex, nonlinear representations of data that tend to far exceedhuman crafted features. Deep learning approaches to the cocktail party problemtend to take noisy spectrograms as input and transform them to clean spectrograms.The use of deep convolutional neural networks and deep denoising autoencoders onspectrograms have proven to be powerful techniques in practice [10]. One drawbackto the use of spectrograms as input is the computation of spectrograms tends tobe high since the short-time Fourier transform has to be applied to the raw audiodata. This prior computation before inputting into the network requires time andhence increases the difficulty of use in real-time applications. In addition, phase3

information of the input speech tends to be lost in many of these approaches sinceonly the magnitude spectrum is used. This can cause degradation in quality at theoutput of the system [11].This master’s thesis is motivated by the deep learning community’s recentlyfocused e orts on end-to-end speech enhancement systems that take the raw timedomain audio signal as input instead of frequency domain features [12] - [14]. Theapproach that will be described in this thesis involves the use of a fully convolutionalneural network (FCN) applied to raw audio data and is motivated by prior workin the area [15]. The approach builds upon the work of [16] that shows poolinglayers may not be necessary for audio processing tasks. The proposed FCN basedalgorithm in this paper is advantageous for many reasons when it comes to a solutionto the cocktail party problem. One reason is an FCN can be viewed as performingfiltering directly in the time-domain and the key idea is the FCN can learn optimal,nonlinear filters for the given task at hand. In addition, an FCN by definitionhas no fully connected layers and generally does a better job at maintaining localtemporal correlations in the audio signal from input to output [15]. Lastly, an FCNwill generally have far fewer parameters than other correspondingly similar deepneural networks due to parameter sharing. This allows for less memory usage andquicker computation which is ideal for real-time applications. Before reviewing theresults of this approach, the proceeding pages will review the necessary backgroundand present an overview of the system.4

Chapter 2Background2.1Speech & Signal Processing FundamentalsThis section will go over some fundamental information related to speech andsignal processing. First, the basics of speech will be reviewed and a simple wayof modeling speech is presented. Next, a discussion of time-dependent Fourieranalysis will take place. Time-dependent Fourier analysis is used in many practicalspeech enhancement applications. After this, a measurement of speech degradationby background noise called signal-to-noise ratio (SNR) will be discussed. Finally,this background section ends with an introduction to filtering signals and a robustalgorithm for perfectly reconstructing a time-domain signal after it has beenprocessed by a system.2.1.1Basics of SpeechSpeech is produced by excitation of an acoustic tube called the vocal tract. Thereare three basic classes of speech sounds:5

Voiced sounds: periodic pulses of airflow excite the vocal tract Fricative sounds: produced by constricting the vocal tract somewhat andforcing air through Plosive sounds: pressure is built up by completely closing o the vocal tractand is then releasedSpeech can be modeled as the response of an LTI system, namely the vocal tract[17]. The vocal tract transmits excitations (vibrations) generated in the larynx tothe mouth. In normal speech, the vocal tract tends to change shape slowly withtime and imposes its characteristic frequencies, called formants, on the excitationtraveling through it. Through this view, the vocal tract is a slow, time-varyingfilter and a speech signal can be expressed mathematically ass(t) e(t) v(t)(2.1)where s(t) is the speech signal, e(t) is the excitation signal, and v(t) is the impulseresponse corresponding to the vocal tract.In a statistical sense, speech is a non-stationary signal. This means that thestatistics of speech generally change over time. When speech is viewed on thetime-scale of 10 - 40 ms, the statistics can be assumed to be relatively constantand Fourier analysis can be applied [18]. The frequency content of speech isgenerally below 8 kHz and hence this implies that the sampling rate used inspeech applications does not need to be higher than 16 kHz. In fact, digitaltelephone communication systems have used sampling rates of 8 kHz without lossof intelligibility [18].6

2.1.2Time-Dependent Fourier AnalysisNon-stationary signals, such as speech, have statistics (i.e. properties such asamplitude, frequency) that change over time. A useful representation of these typesof signals is called the spectrogram [18]. A spectrogram provides a time-frequencyrepresentation of a signal by using a mathematical tool called the short-time Fouriertransform:X[n, !] 1Xx[n m]w[m]ej 2 !mN(2.2)m 1where x[n] is a discrete signal with N points, w[m] is a windowing sequence generallyof shorter length than x[n], n is a discrete-valued variable representing time, and !is a discrete variable representing frequency.For discrete signals, the short-time Fourier transform (STFT) can be interpreted as a sliding (through time) discrete Fourier transform (DFT) applied towindowed chunks of the signal. For each windowed chunk of the signal, the DFTextracts frequency information. The use of a windowing sequence is used to breakthe signal up into “pieces” and ensure smooth transitions in frequency informationthrough time. A popular windowing sequence used in practice is called the Hanningwindow and it is defined as:w[n] 8 0.5 :0,0.5 cos( 2 n), 0 n MM(2.3)otherwiseA spectrogram plots the magnitude of X[n, !] across time and across frequencyin a 2-D representation. The value of the magnitude response is represented byvarious colors in this 2-D representation (white generally representing higher7

magnitudes, black representing lower magnitudes). For DFT application on realvalued finite discrete signals, the discrete valued frequency variable, !, uniquelyand exhaustively describes all frequency content of the the input signal whenviewed on the domain of {0, 1, 2, ., N2 }. The reason for this is found in the studyof discrete sampling theory, including the Nyquist-Shannon Sampling Theorem [18].Conceptually, the idea is to treat finite real-valued signals as cyclical in time andin order to represent the information present in the signal the sampling rate mustbe at least twice the maximum frequency present in the signal. These facts allow aspectrogram plot to have a finite frequency axis, as seen in the figure below.Figure 2.1: Spectrogram of the spoken words “nineteenth century” [19]2.1.3Signal-to-Noise Ratio (SNR)A common measure for quantifying the amount a signal has been degraded by thepresence of background noise is called the signal-to-noise ratio (SNR) [18].SN R 10 log108xe22(2.4)

wherex2represents the variance of the signal ande2represents the variance ofthe background noise. The units of SNR are named decibels (dB). If a signal is inthe presence of background noise such that the SNR is equal to 0 dB, this impliesthat the relative power of each is about equal. A positive SNR indicates the signalpower is stronger than the noise power, while a negative SNR indicates the noisepower is stronger than the signal power.2.1.4FilteringThe concept of filtering in signal processing refers to the removal of unwantedfrequency components from a signal. Commonly used filters in signal processingare found inside the class of linear-time invariant (LTI) systems. These filtersare characterized entirely by their impulse response [18]. Specifically, the outputsignal can be expressed as a convolution of the filter’s impulse response with theinput signal. Many types of LTI filters exist with two popular ones being the ideallow-pass and ideal high-pass filters. The ideal low-pass filter is a system designedfor removing frequency components above a specified cuto frequency, while theideal high-pass filter is a system designed for removing frequency components belowa specified cuto frequency. In practicality, ideal filters are not realizable but manyapproximations exist such as Butterworth filters and Chebyshev filters [18].2.1.5Overlap-Add Method of ReconstructionIn applications, such as speech enhancement and audio coding, where the inputsignal’s time-dependent Fourier transform is modified, the overlap-add method ofreconstruction provides a robust algorithm for perfectly reconstructing the output9

time domain signal [18].Suppose that R L N . The following decomposition can be expressed:N 12 !1 Xxr [m] x[rR m]w[m] Xr [k]ej N mN ! 00 m L1(2.5)where x[n] is an N -point signal, w[n] is an L-point windowing sequence, R representsthe spacing between successive DFTs, and xr [n] represents the rth recoveredwindowed slice of the signal x[n]. If the following condition is assumed about thewindowing sequence:1Xw[nrR] 1(2.6)r 1Then x[n] can be perfectly reconstructed by shifting the recovered segments totheir original time locations and summing:x[n] 1Xxr [nrR](2.7)r 1An example of a windowing sequence that satisfies the above criteria is the Hanningwindow (discussed in Section 2.1.2) with length L M 1 and R M/2.2.2Traditional Speech Enhancement MethodsTo get a better sense of the history of speech enhancement, this section will reviewa few traditional methods for removing background noise from a corrupted speechsignal. These methods include spectral subtraction, Wiener filtering, and IdealBinary Mask (IBM) estimation. In addition, a brief overview of popular evaluationmetrics for speech enhancement systems will be presented. The metrics to bepresented are named perceptual evaluation of speech quality (PESQ), short-time10

objective intelligibility (STOI), and word error rate (WER).2.2.1Spectral SubtractionOne of the first techniques introduced in the field of speech enhancement is calledspectral subtraction [21]. The main idea of spectral subtraction is to obtainan estimate of the magnitude spectrum of the background noise and subtractthis estimate from the magnitude spectrum of the combined target speech andbackground noise. The final result of this computation is an estimate of thetarget speech’s magnitude spectrum which can be used to invert back into thetime-domain.Suppose a target speech signal x[k] and statistically independent additivenoise n[k]. Then speech corrupted by background noise, y[k], can be representedas follows:(2.8)y[k] x[k] n[k]This implies the following in the short-time Fourier domain:X[k, !] Y [k, !](2.9)N [k, !]where X[k, !] is the STFT of x[k], Y [k, !] is the STFT of y[k], and N [k, !] is theSTFT of n[k]. This can be equivalently expressed in polar form:X[k, !] Y [k, !] ejwherey (k, !)y (k,!)is the phase of Y [k, !] and N [k, !] ejn (k, !)n (k,!)(2.10)is the phase of N [k, !]. Inpractice, it can be shown that the noise-free phase can be estimated by the noisy11

phase which implies:y (k, !) (2.11)n (k, !)This assumption leads to the following:X[k, !] ( Y [k, !] N [k, !] )ejy (k,!)(2.12)Therefore, to obtain an estimate of the STFT of the target speech, X̂[k, !], anestimate of the magnitude of the STFT of the noise, N̂ [k, !] is required:X̂[k, !] ( Y [k, !] N̂ [k, !] )ejy (k,!)(2.13)X̂[k, !] can finally be inverted back to the time domain with the help of theoverlap-add method of reconstruction to recover an estimate of the target speech,x̂[k].In practice, N̂ [k, !] can be obtained by sampling the noise during pausesin the speech, computing the STFT of these samples, and then averaging themagnitude spectrums across these sampled STFTs to obtain an estimate of N [k, !] .The main drawback of the spectral subtraction algorithm is the limited ability toobtain a precise estimate of N [k, !] . This is especially a problem for backgroundnoise that is non-stationary, such as babble noise as illustrated in the cocktail partyproblem. A poor estimate of N [k, !] will tend to cause errors in the subtractionstep which can result in remnant noise and speech distortion of the target speechestimate, x̂[k].12

2.2.2Wiener FilterIn the study of LTI systems and filtering, a natural question arises pertaining tofinding the minimum-mean-square-error (MMSE) filter of a wide-sense stationary(WSS) input process. This optimal MMSE filter is called the Wiener filter. Thederivation for characterizing the Wiener filter (in discrete time) will be given below[22].Suppose a WSS random process, x[n]. The goal is to determine the frequencyresponse characterizing an LTI system, h[n], that outputs a WSS process ŷ[n] thatis the minimum-mean-square-error (MMSE) estimate of some target process y[n]that is jointly WSS with x[n].Figure 2.2: A diagram representing an input process, x[n], passing through an LTIsystem, h[n], that outputs an estimate ŷ[n] of the target process y[n] [22].The error, e[n], between the filter’s output, ŷ[n], and the target process, y[n],is defined as follows:e[n] , ŷ[n]y[n](2.14)An optimization problem can be written down that is solved by finding the LTIfilter’s impulse response, h[n], (the Wiener filter) that satisfies the following criteria:minimize E{e2 [n]}h[.]13(2.15)

First, the error criterion is expanded using the fact that the output of an LTI filtercan be expressed as a convolution of its impulse response with the input signal: E{(1Xh[k]x[nk]k 1y[n])2 }(2.16)The goal is to choose the values of h[m] for all m that minimize this error criterion, .Multivariate optimization is applied to minimize by taking the partial derivativeof with respect to h[m] for each m and setting each of these expressions equal tozero.X@ E{2(h[k]x[n@h[m]kk]y[n])x[nm]} 0(2.17)This implies the following:Rex [m] E{e[n]x[nm]} 0 f or all m(2.18)By Equation 2.18 and the definition of orthogonality, it can be concluded thatthe error signal and the input signal are mutually orthogonal. This orthogonalitycondition can be equivalently re-written as follows:Rex [m] E{e[n]x[nm]} E{(ŷ[n]y[n])x[nm]} Rŷx [m]Ryx [m](2.19)Combining the orthogonality condition stated in Equation 2.18 with Equation 2.19,the following statement is true:Rŷx [m] Ryx [m] f or all m(2.20)Equation 2.20 says that the optimal filter’s estimate of the target process has a14

cross-correlation with the input process that is equal to the cross-correlation of thetarget process’ cross-correlation with the input process. Since the estimate, ŷ[n] isobtained by inputting the input process x[n] through an LTI filter, the followingconvolution relationship applies:Rŷx [m] h[m] Rxx [m](2.21)Combining Equation 2.20 and Equation 2.21 implies:Ryx [m] h[m] Rxx [m](2.22)Then taking the z-transform of both sides of Equation 2.22:Syx (z) H(z)Sxx (z)(2.23)where Syx (z) is the cross-spectral density of y[n] and x[n] and Sxx (z) is the powerspectral density of x[n]. Therefore, the optimal LTI filter in the MMSE sense (theWiener filter), is characterized by the following equation:H(z) Syx (z)Sxx (z)(2.24)The Wiener filter tends to perform better than spectral subtraction in practice,but it su ers from the fact that it is constrained to be a linear estimator. Alinear estimator may not have enough complexity to remove highly non-stationarybackground noise.15

2.2.3Ideal Binary Mask EstimationAnother common technique in the field of speech enhancement is based on theconcept of an Ideal Binary Mask (IBM) [21]. The idea of an IBM arises froma model for human auditory perception called Auditory Scene Analysis (ASA).ASA can be broken down into two stages. The first stage, called the segmentationstage, involves the decomposition of an input signal into time-frequency units (T-Funits). An example of an input signal can be speech or any other type of soundthat enters the human auditory system. After the segmentation stage is the secondstage called the grouping stage. The grouping stage involves grouping T-F unitsthat are most likely to have been generated from the same source. This model,proposed by Albert Stanley Bregman in 1990, is theorized to model how the humanauditory system separates sounds in an input signal mixture. ASA has inspired thefield of Computational Auditory Scene Analysis (CASA). CASA’s main focus isto find computational means of separating an input signal mixture similar to howa human does so [23]. In a typical CASA system, an input signal is first passedthrough a gammatone filter bank to generate a T-F representation that mimicsthe human auditory system. This T-F representation is called a cochleagram. Thenext goal in a typical CASA system is to use the cochleagram to separate an inputsignal mixture into groups. For speech enhancement, this process of separationbrings up the concept of an Ideal Binary Mask. Put simply, an Ideal Binary Maskis a decision rule that determines whether a T-F unit in the T-F representationis dominated by the noise source or by the target speech. The IBM, H[n, w], is16

defined as:H[n, w] 8 1, :0,2if kX[n,w]k ,kN [n,w]k2(2.25)otherwisewhere kX[n, w]k2 represents the energy in a speech T-F unit at position [n, w],kN [n, w]k2 represents the energy in a noise T-F unit at position [n, w], and is athreshold value. Conceptually, the IBM attempts to remove T-F units in which thenoise signal’s energy is higher than the speech signal’s energy according to somethreshold, . In theory, an IBM will preserve the T-F units that correspond tothe target speech. Though in practice, one will not have direct access to both thetarget speech and noise sources and therefore an IBM will need to be estimated.Machine learning techniques, such as suppor

speech from noise requires prior knowledge of both, as the mask is created based o of the relative strengths of the speech signal and the noise. This strategy also faces diculty if the noise and target speech occupy similar frequency ranges as is the case with babble noise. More recent studies in speech enhancement related to the cocktail .

Related Documents:

LNCS 8692 - Learning a Deep Convolutional Network for ...

Learning a Deep Convolutional Network for Image Super-Resolution . a deep convolutional neural network (CNN) [15] that takes the low- . Convolutional Neural Networks. Convolutional neural networks (CNN) date back decades [15] and have recently shown an explosive popularity par-

47 Views

3y ago

A Lightweight Multi-Source Fast Android Malware Detection Model

Performance comparison of adaptive shrinkage convolution neural network and conven-tional convolutional network. Model AUC ACC F1-Score 3-layer convolutional neural network 97.26% 92.57% 94.76% 6-layer convolutional neural network 98.74% 95.15% 95.61% 3-layer adaptive shrinkage convolution neural network 99.23% 95.28% 96.29% 4.5.2.

11 Views

1y ago

Flexible, High Performance Convolutional Neural Networks ...

2 Convolutional neural networks CNNs are hierarchical neural networks whose convolutional layers alternate with subsampling layers, reminiscent of sim-ple and complex cells in the primary visual cortex [Wiesel and Hubel, 1959]. CNNs vary in how convolutional and sub-sampling layers are realized and how the nets are trained. 2.1 Image processing .

54 Views

3y ago

Convolutional Neural Network Architectures: from LeNet to ...

ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, ECCV 2014 K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, ICLR 2015

42 Views

3y ago

Video Super-Resolution With Convolutional Neural Networks

Video Super-Resolution With Convolutional Neural Networks Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K. Katsaggelos, Fellow, IEEE Abstract—Convolutional neural networks (CNN) are a special type of deep neural networks (DNN). They have so far been suc-cessfully applied to image super-resolution (SR) as well as other image .

38 Views

3y ago

Convolutional Neural Network for Sentence Classification

of networks are updated according to learning rate, cost function via stochastic gradient descent during the back propagation. In the following, we brieﬂy introduce the structures of di erent DNNs applied in NLP tasks. 2.1.1 Convolutional Neural Network Convolutional neural networks (CNNs) learn local features and assume that these features

36 Views

3y ago

Image Colorization with Deep Convolutional Neural Networks

Image Colorization with Deep Convolutional Neural Networks Jeff Hwang jhwang89@stanford.edu You Zhou youzhou@stanford.edu Abstract We present a convolutional-neural-network-based sys-tem that faithfully colorizes black and white photographic images without direct human assistance. We explore var-ious network architectures, objectives, color .

52 Views

3y ago

Deep neural networks I - University of California, Davis

Deep Neural Networks Convolutional Neural Networks (CNNs) Convolutional Neural Networks (CNN, ConvNet, DCN) CNN a multi‐layer neural network with – Local connectivity: Neurons in a layer are only connected to a small region of the layer before it – Share weight parameters across spatial positions:

41 Views

3y ago

Recent Views

BETTER NUTRITION BRIGHTER FUTURE - Maryland.gov Enterprise Agency Template

TOFU BUY: 12- to 16-ounce container Brands and types shown here ONLY Not WIC Approved: With added fats, sugar, oil, or salt With added ﬂavorings, sauces, or seasonings Azumaya Extra Firm Franklin Farms Firm, Medium Firm, Extra Firm, Soft House Foods Organic: Soft, Firm, Medium Firm, Extra Firm Premium: Soft, Firm, Medium Firm, Extra Firm

1y ago

192 Views

Leaving a Law Firm: A Guide to the Ethical Obligations in Law Firm .

associates or otherwise employed in the firm "not to (1) actively exploit their positions within the [law firm] for their own personal benefits, or (2) hinder the ability of the [law firm] to conduct the business for which it was developed." Burke v. Lakin Law Firm, 2008 WL 64521 (S.D.Ill. Jan. 3, 2008), quoting FoodComm Intern. V.

1y ago

113 Views

Uses of Special Purpose Vehicles (SPVs) in structuring financing .

TFR "Best Law Firm in Trade Finance" Trade & Forfaiting Review (TFR) named Sullivan & Worcester "Best Law Firm in Trade Finance" in its 2014, 2015 and 2016 TFR Excellence Awards . GTR "Best Law Firm" Sullivan & Worcester UK LLP was top ranked firm in the . Global Trade Review (GTR) Best Law Firm 2015 and 2016 polls . The Legal 500 UK . 2016

1y ago

139 Views

Global Elite Law Firm Brand Index 2022 - thomsonreuters

such areas as law firm brand, firm usage, and legal market trends. The responses are distilled . down into four different and non-related measures gathered from the Sharplegal research and . then used to generate the individual Global Elite Law Firm Brand Index score for each law firm. How we generate our insights. In-depth interviews with

1y ago

166 Views

Notice and Order - Law Firm Names - Amendments to RPC 7.5 and Related Rules

LAW FIRM NAMES - AMENDMENTS TO RPC 7.5 AND COURT RULES 1:21-1A, 1:21-1B, AND 1:21-1C The Supreme Court has adopted amendments to Rule of Professional Conduct 7.5 ("Law Firm Names and Letterheads") so as to remove the requirement that the law firm name include the name of a lawyer and describe the nature of the firm's legal practice.

1y ago

134 Views

Law Student's Guide to the Washington, DC-Area Law Firm Market

Years 6-8: Return to law firm as senior associate or counsel . Benefit: In addition to your government experience, law firm employers will value your prior firm experience with billing time, working with private sector clients, etc. In other words, you already "know how law firms work" and this provides a smoother transition back. *Disclaimer:

1y ago

151 Views

12 PUBLIC LAW AND PRIVATE LAW - Home: The National .

INTRODUCTION TO LAW MODULE - 3 Public Law and Private Law Classification of Law 164 Notes z define Criminal Law; z list the differences between Public and Private Law; and z discuss the role of Judges in shaping Law 12.1 MEANING AND NATURE OF PUBLIC LAW Public Law is that part of law, which governs relationship between the State

3y ago

745 Views

Dr. Ram Manohar Lohiya National Law University, Lucknow

2. Health and Medicine Law 3. Int. Commercial Arbitration 4. Law and Agriculture IXth SEMESTER 1. Consumer Protection Law 2. Law, Science and Technology 3. Women and Law 4. Land Law (UP) Xth SEMESTER 1. Real Estate Law 2. Law and Economics 3. Sports Law 4. Law and Education **Seminar Courses Xth SEMESTER (i) Law and Morality (ii) Legislative .

3y ago

496 Views

Overcoming Ethical Challenges for Multi-Firm Lawyers and Their Firms .

- Florida Bar Op. 94-7: o Law firm refers personal injury cases to a lawyers who is "of counsel" to the firm and who sometimes works in the law firm's offices, but who also . Formal Ops. 1995-9: o A law firm named "A B & C" is a NY partnership consisting of partners A, B, and C. Motivated by tax concerns, C retires and becomes .

1y ago

116 Views

Companies Law - Cayman Islands dollar

Law 1 of 1971-15th December, 1970 Law 7 of 2000- 20th July, 2000 Law 7 of 1973-28th June, 1973 Law 5 of 2001-20th April, 2001 Law 24 of 1974-22nd November, 1974 Law 10 of 2001-25th May, 2001 Law 25 of 1975-9th December, 1975 Law 29 of 2001-26th September, 2001 Law 19 of 1977-10th November, 1977 Law 46 of 2001-14th January, 2002

3y ago

454 Views

It’s the Law!

ciples stated in Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law. Students will be able to explain the application of Boyle’s Law, Charles’ Law, Gay-Lussac’s Law, Henry’s Law, and Dalton’s Law to observations or events related to SCUBA diving. MateriaLs None audio/visuaL MateriaLs None teachinG tiMe

2y ago

378 Views

WHAT LAW IS ? An Introduction to Law

common law system civil law system!! sources of law in civil law !! a1. primary: statutes (written law) enacted by legislative power are the principal source of law. ! a2. two subsidiary sources of law: ! a2.1 administrative regulations a.2.2 customs!! ! sources of law in common law !!! b1. two primary sources of

2y ago

385 Views

Law Firm Performance Metrics - Thomson Reuters

ProLaw XII reporting offers a firm the capability to turn data into knowledge for law firm performance management. The new reporting features within ProLaw XII provide key financial and operational metrics necessary to monitor firm performance - many of which can be self‐defined by the firm.

1y ago

104 Views

CHAPTER 11 35 per hour to firm A but differ in their .

flock to the piece rate firm. After the price of output falls, firm A values all workers at 17.50 per hour, while worker 1’s value at firm B falls to 50 cents, worker 2’s value falls to 1 at firm B, etc. The question is what happens to the wage. Presumably wage also falls, to 17.50 per hour in firm A.

2y ago

165 Views

Faculty of Juridical, Social and Political Sciences Year .

Law L Law IV 8 Drept procesual civil II / Civil Procedure Law II 5 Law L Law IV 8 Dreptul comerțului internațional / International ommercial Law 4 Law L Law IV 8 riminalistică / Forensics 4 Law L Law IV 8 Practică de cercetare pentru elaborarea lucrării de lincență(3 săptămân

2y ago

384 Views

A Fully Convolutional Neural Network Approach To End-to-End Speech .

It looks like you're using an ad-blocker