PAPERS - Columbia University

3y ago
4 Views
3 Downloads
2.02 MB
24 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Lee Brooke
Transcription

PAPERSiEvaluationof a Method for Separating DigitizedDuet Signals*ROBERT C. MAHERDepartment of ElectricalEngineering,University of Nebraska-Lincoln,Lincoln, NE 68588-0511,USAA new digital signal-processing method is presented for separating two monophonicmusical voices in a digital recording of a duet. The problem involves time-variantspectral analysis, duet frequency tracking, and composite signal separation. Analysisis performed using a quasi-harmonic sinusoidal representation based on short-timeFourier transform techniques. The performance of this approach is evaluated usingreal and artificial test signals. Applications include background noise reduction in liverecordings, signal restoration, musicology, musique concrete, and digital editing,splicing, or other manipulations.0 INTRODUCTIONSeparation of superimposed signals is a problem ofinterest in audio engineering. For example, it wouldoften be useful to identify and remove undesired interference (such as audience or traffic noise) presentduring a live recording. Other examples include separation and replacement of errors in a recorded musicalperformance,separation of two simultaneous talkersin a single communications channel, or even adjustmentof the level imbalance occurring when one musicianin an ensemble briefly turns away from the microphone.Considered in this paper is a digital signal-processingapproach to one aspect of the ensemble signal separationproblem: separation of musical duet recordings. Theprimary goal of thisproject was to develop and evaluatean automatic signal separation system based primarilyon physical measurementsrather than psychoacousticmodels of human behavior,In order to separate the desired and undesired signalswe must resort to prior knowledge of some aspect ofthe superimposed signals, whereby a set of separation* Manuscript received 1990January 29;revised 1990June21.956criteria may be identified. If two interfering signalsoccupy nonoverlappingfrequency bands, for example,the separation problem can be solved by using frequency-selectivefilters. In other cases the competingsignals may be described in a statistical sense, allowingseparation using correlation or a nonlinear detectionmethod. However, most superimposedsignals, suchas two musical instruments playing simultaneously,donot allow for such elementary decomposition methods,and other strategies applicable for signal separationmust be discovered.In the case of ensemble music, sounds emanatingfrom different musical instruments are combined in anacoustic signal, which may be recorded via a transducerof some kind. Despite the typical complexity of therecorded ensemble signal, a human listener can usuallyidentify the instruments playing at a given point intime. Further, a listener with some musical training orexperience can often reliably transcribe each musicalvoice in terms of standard musical pitch and rhythm.Unfortunately the methods and strategies used by humanobservers are not introspectable and thus cannot serveeasily as models for automatic musical transcriptionor signal separation systems.In orderto putthe currentworkintoperspective,theJ. AudioEng.Soc.,Vol.38,No.12,1990December

PAPERSSEPARATINGDIGITIZEDDUETSIGNALSnarrative portion of this paper begins with a review ofthe approach and methods used in this investigation,The separation procedure is described next, followedby a critical evaluation of the results and a concludingsection concerning the successes, failures, and futureprospects of this research,may occur during shared rests); level imbalances between voices may be present; noise ofvarious kinds mayhinder the detection process, and so forth. In the frequency domain the partials (overtones) of the variousvoices will often overlap, preventing simple identification of the spectra of each voice. Further, the fundamental basis of most music is time, so some meansIto segment the recording into time intervals and tocorrelate the parameters from instant to instant mustbe concocted.REVIEWOF APPROACHAND METHODSPrevious work related to the signal separation problemconsidered here has been primarily in two areas, 1)separating the speech of two people talking simultaneously in a monaural channel (also called cochannelspeech separation) [1]-[6] and 2) segmentation and/or transcription of musical signals [7]-[10]. The goalof the speech separation task is to improve the intelligibility of one talker by selectively reducing the speechof the other talker, while for music separation the goalis to extract the signal of a single instrumental line (orto produce printed musical notation) directly from arecording,1.1 Separationof Speech and MusicThe cochannel speech and musical signal separationproblems share some common approaches. Both taskshave typically been formulated in terms of the timevariant spectrum of the individual signal sources. Thisis appropriate because one possible basis for separatingadditively combined signals is to distinguish the frequency content of the individual signals, assuming alinear system,1.1.1 Cochannel Speech SeparationThe cochannel speech separation work reported todate typically relies on some assumptions about thespectral properties of speech. For voiced speech (suchas vowel sounds), the short-time magnitude spectrumcontains a series of nearly harmonic peaks correspondingto the fundamental frequency and overtones of thespeech signal. With two talkers, the composite spectrumcontains the overlapping series of peaks for both voices,The common approach has been somehow to identifywhich peaks go with which talker and to isolate them.The separation itself has been attempted using combfilters to pass only the spectral energy belonging toone of the talkers (or notch filters to reject one of thetalkers), identification and separation of spectral features(peaks) belonging to one of the talkers, and even extraction of speech parameters appropriate for use inregeneratingthe desired speech using a synthesis algorithm. No separation process specifically for unvoicedspeech (such as fricatives or noise) has been reportedin the literature.1.1.2 Segregation of Voices in Ensemble MusicRecordingsIdentification of pitches, rhythms, and timbres froma musical recording is not a trivial task in general. Thedifficulties are formidable. The ensemble voices mayoccur simultaneouslyor separately (and no voicesJ. Audio Eng. Soc., Vol. 38, No. 12, 1990 December1.2 ResearchLimitationson the Scopeof theSeparation ProblemBecause the musical signal separation problem is socomplex, the initial need for this investigation was tosimplify the conditions. Thus the range of input possibilities was limited by the following restrictions:1) The recordings to be processed may contain onlytwo separate, monophonic voices (musical duets).2) Each voice of the duet must be harmonic, or nearlyso, and contain a sufficientnumber of partials so thata meaningful fundamental frequency can be determined.3) The range of fundamental frequencies for eachvoice must be restricted to nonoverlapping ranges, thatis, the lowest musical pitch of the upper voice must begreater than the highest pitch of the lower voice. Notethat a duet that does not meet this requirement in totomay still be processed if it can be divided manuallyinto segments obeying this restriction.4) Reverberation, echoes, and other correlated noisesources are discouraged since, in effect, they representadditional "backgroundvoices" in the recording andviolate the duet assumption.Despite these seemingly severe restrictions, the remaining difficulties are still nontrivial: how to separatethe partials of the two voices when the spectra overlap;how to determine whether zero, one, or both voicesare present at a given point in time; how to track eachvoice reliably when one is louder than the other; andso on. Moreover, success with a particular duet doesnot automatically guarantee success on every other duetexample. In fact, projects of this sort can rapidly fallinto the trap of ad hoc, special-purpose techniques tosolve a particular problem, only to find another problemcreated.The system developed during this research projectwas not necessarily intended for real-time operation.Thus the algorithms were implemented in software ona general-purpose computer. This approach has the advantages of extensive software support, relative easeof testing, and rapid debugging cycles.1.3 FundamentalResearchQuestionsThe major goal of this investigation was to demonstrate the feasibility of automatic composite signal decomposition using a time-frequencyanalysis procedure. This problem can be stated as two fundamentalquestions.1) How may we automatically obtain accurate estimates of the time-variant fundamental frequency of957

MAHEReach musical voice from a digital recording of a duet?2) Given time-variant fundamental frequency estimates of each voice in a duet, how may we identifyand separate the interfering partials (overtones) of eachvoice?Question 1) treats the problem of estimating the timevariant frequencies of each partial for each voice. Assuming nearly harmonic input signals, specification ofa fundamental frequency identifies the partial componentfrequencies of that voice. Conflicting (coincident) partial frequencies between the two voices can then beidentified by comparing the predicted harmonic seriesof the two duet voices,Question 2) involves the fundamental constraints onsimultaneous time and frequency resolution. The desirefor high-resolutionfrequency domain information requires observation of the input signal over a long timespan. However, long observation spans often result inan unacceptable loss of time resolution by averagingout any spectral changes during the observation interval,Thus the analysis system must somehow cope with thisinherent uncertainty in determining the best time-versusfrequency representationfor the input duet signal,Note that question 2) can be treated separately fromquestion 1) if the time-variant fundamental frequencypair for the duet can be obtained by some manual means,For example, a duet synthesized with known fundamental frequencies (a priori frequency information)can be used to evaluate a preliminary separation algorithm. Thus the two fundamental questions can betreated initially as separate problems if desired,1.4 ResearchQuestion1: Duet FrequencyTrackingThe duet separation methods considered in this paperrequire good estimates of the fundamental frequencyof each voice at all times. This information could comefrom an accurate musical score, some manual meansof tabulation, or an automatic frequency tracking system. However, even if a musical score is available,musiciansseldom play music with an exact, one-toone correspondencewith the printed information.Manual methods can be quite reliable, but are extremelytedious and time consuming. Thus automatic methodsare of primary interest in this paper,1.4.1 Common Methods for Pitch DetectionFundamental frequency tracking is often called pitchdetection or pitch extraction. Numerous reports describing algorithms for monophonic pitch detection havebeen published, including the cepstrum method [11],autocorrelation [12], the period histogram and otherharmonic-based methods [13[, [14], the optimum comband average magnitude difference function (AMDF)[15], [16], and methods based on linear prediction [17],[18]. Also, time-domain methods to determine pitchperiods by zero crossings, peak detection, or clippedwaveform analysis have been developed. Unfortunatelyno single method for pitch detection has been found tobe reliable for arbitrary input signals,958PAPERS1.4.2 Application of MonophonicMethodsto DuetsFor the duet separation task, the difficulties of monophonicpitch detectionare compoundedby the presenceof two competing signal sources. There is no certaintythat a monophonic pitch detection scheme can handlemultiple simultaneous signals. For example, the autocorrelation and optimum comb methods are used toidentify periodicites in the input signal by searchingfor a delay lag To that maximizes the integrated product(autocorrelation)or minimizes the summed absolutevalue of the difference (optimum comb and AMDF).The fundamental frequency estimate is given by fo 1/To. However, identification of the extremum correspending to the "best" To is not trivial because thesearch functions contain many subextrema, that is, thefunctions are not unimodal. Also, delay lags of an integral number of waveform periods will show similarextrema in the autocorrelationor AMDF, leading topossible octave errors.The problem of octave errors is a common obstacleto many pitch detection algorithms, including harmonicbased methods. The difficulties are particularly noticeable for instruments with strong resonances suchthat certain upper partials (or ranges of partials) containmuch more energy than the lower partials. In situationswhere the search range is known to be limited to lessthan an octave (often the case with speech) the octaveerror problem can be reduced. Musical melodies, onthe other hand, often span a larger fundamental frequency range. Moreover, when two sources are presentin the input signal, interactions between the numerouspairs of partials cause additional difficulties, whichmake most monophonic pitch detection methods impractical for direct application to the duet case. Forthis reason, a new scheme for duet frequency trackingwas developed for this project, as described in Sec. 2.1.5 ResearchQuestion2: Time-FrequencyAnalysisThe second research question treats the generalproblem of identification and separation of the spectralcomponents in a musical duet. Specifically, some usefulrepresentation of the duet signal simultaneously in thefrequency and time domains must be obtained. Usefulparametric models of musical instruments are usuallynot known, so any parametric spectral analysis methodwill require estimation of an unwieldy number of parameters. Thus the approach for this investigation wasto use a standard nonparametricspectral estimationmethod, the short-time Fourier transform (STFT).1.5.1 Review of the Short. Time FourierTransform (STFT) AnalysisThe STFT has been used widely in the analysis oftime-varying signals, such as speech and music [19][24]. The STFT takes a one-dimensionaltime-domainsignal (amplitude versus time) and produces a twodimensional representation (amplitude versus frequencyJ.AudioEng.Soc.,Vol.38,No.12,1990December

PAPERSSEPARATINGversus time). This can be expressed for time-sampledsignals in discrete form [25]:L-IX(n, k)w(n--m)x(m)e-j2xmk/L,DUET SIGNALSThen Eq. (2) becomesacX(n, k) DIGITIZED e -j2 nk/L(l)Ex(rt)(4)e -j21trk/Lr 0m -ocin which: x(m) is a signal defined for any sample timem, w(m) is a low-pass impulse response (window)function defined for any m, L is the number of equallyspaced frequency samples between 0 Hz and the samplerate (or 0 to 2 normalized radian frequency), andX(n, k) is the discrete STFT of x(m) at every sampletime n at normalized radian frequency 2 rk/L.This equation is called the STFT analysis equationbecause it describes the STFT in terms of the inputsignal x(m). With time-variant input the STFT can bethought of as providing a series of"snap shots" of thesignal spectrum obtained over some chosen time in-which can be recognized as the DFT of 2(n) multipliedby a linear phase shift term.If we choose the window function w(q) to be zerofor q L/2 and q -L/2, and noting that 0 r L, the expression for (n) is nonzero only under thefollowing conditions on p and r:p 0and{0 r L/2}p - 1and{L/2 r L}orgiving(n) w(-r)x(n r)- ,L r) ,w(L - r)x(nterval,The infinite sum in the STFT analysis equation isactually finite in practice because the window functionw(m) is typically chosen to be real, with even symmetryabout the origin (noncausal, zero phase), and nonzeroonly for a finite range of points centered about theorigin (see Harris [26] for a description of various window functions),For computational efficiency it is often useful to express the STFT analysis equation in the form of thediscrete Fourier transform (DFT) so that a fast Fouriertransform (FFT) algorithm can be used to perform thesummation calculations.Using a change of variablesm n pL r and exchanging the order of summation,the analysis equation can be written in terms of blocksL samplesThus we can compute X(n, k) by generating the intermediate signal 2(n), performing the DFT (using anFFT algorithm, if desired), and compensating for thelinear phase term e -j2 rnk/L. This process is depictedin Fig. 1.In considering Eq. (4) we see that the formal definitionof the STFT requires a series of overlapping DFTs torepresent x(n) at every time n. This overlap may seemunnecessary, considering that the original signal canbe reconstructed exactly from the inverse transformsof concatenated nonoverlapping segments, that is, thediscrete Fourier transform is perfectly invertible. Thisobservation would be useful and reasonable if the onlylong,X(n, k) (5)for0forL/2 r r L/2L .w(n-m)e -j2wnk/L Er 0-r)x(np -w(-pL pL 'r)e -j2 pk'Eiv \ /:',.-./,m nx.,' \ ?v'me -j2 rk/L(2)y(m)iDefining oc(n) ]Ew(-pL-r)x(n pL r)e-j2xrpkm p -vc(3a)Fourieror sincex(n)e -j2xpk:! (for p and k integers), w(-pLall p-r)x(nwe have pL r) .d. Audio Eng. Soc., Vol. 38, No. 12, 1990 December(3b)TransformIIXn(k )Fig. 1. The Fourier transform viewpoint of STFT. A segmentof the digitzed signal x(m) is multiplied by the reversed andshifted window function w(n - m). The resulting signaly(m) is processed by the discrete Fourier transform.959

MAHERPAPERSinterest was in obtaining an identity analysis/synthesisprocedure. However, for the duet separation problem(and for other tasks) it may be useful to interpret andmodify the frequency-domainrepresentationof thesignal, which generally requires knowledge of the signalfor every frequency index k at every time n. Fortunately,in practice the STFT can be performed every R samplesof the input signal instead of every sample because thecopies of the window function w(n) reversed and shiftedby multiples of the hop size R. If the summation isconstant and exactly equal tO L for all s and n, the OLAprocess can exactly invert the STFT. Fortunately it canbe shown [27] that any low-pass window function w(n)which is band-limited to frequency B 1/2R satisfiesthe equation1output sequencefor low-passa particularfrequencyindexisband-limitedby thewindowfunctionusedk inthe analysis. This allows the STFT to be resampled ata lower rate than the signal sample rate. The framespacing R can be called the analysis hop and must cdrrespond to a rate 1/R at least twice the bandwidth ofthe analysis window in order to meet the Nyquist criterion, in this case applied to resampling the STFT.Choosing a smaller analysis hop size has the desirableeffect of improving the time resolution of the sampledSTFT, while choosing a larger hop reduces the framerate, and thus the storage requirements and computationload. This resolution/computationtradeoff must be addressed to fit the needs of a given situation,The STFT analysis equation, including an analysishop,is givenbyX(sR, k) (6)e -j2 sRk/L DFT{2(sR)}where s is an integer.1.5.2 Review of the Short-Time FourierTransform (STFT) SynthesisThe synthesis equation correspondingto the STFTanalysis equation, Eq. (4), can be expressed as anoverlap-add (OLA) procedure,L-I2(n)--X(m, k) e j2 nk/L(7)all m k O' s w(sR alln) -- W(0) constant(11)where W(.)is the Fourier transform of w(.). Of course,any time-limited window cannot be completely bandlimited, so the hop size R must often be chosen basedon some performance criterion. The scaling factor Lcan be included implicitly by scaling the time-domainwindow function prior to analysis, if desired.The identity property of the STFT (the original signalcan be resynthesized perfectly) implies that the analysisdata contain all the information present in the originalsignal. This attribute is important because it theoreticallyal

Considered in this paper is a digital signal-processing from different musical instruments are combined in an approach to one aspect of the ensemble signal separation acoustic signal, which may be recorded via a transducer problem: separation of musical duet recordings. The of some kind. Despite the typical complexity of the

Related Documents:

Columbia 25th Birthday Button, 1992 Columbia 25th Birthday Button, 1992 Columbia Association's Celebrate 2000 Button, 1999 Columbia 40th Birthday Button, 2007 Lake-Front Live, Columbia Festival of the Arts Button, n.d. Columbia 22nd Birthday Button, 1989 I Love Columbia Button, n.d. Histor

1Data Science Institute, Columbia University, New York, NY, USA 2Department of Systems Biology, Columbia University Medical Center, New York, NY, USA 3Department of Statistics, Columbia University, New York, NY, USA 4Department of Com-puter Science, Columbia University, New York, NY, USA. Corre-spondence to: Wesley Tansey wt2274@columbia.edu .

Columbia Days Inn 1504 Nashville Highway Columbia, TN 38401 1-800-576-0003 Comfort Inn Columbia 1544 Bear Creek Pike Columbia, TN 38401 1-866-270-2846 Holiday Inn Express 1554 Bear Creek Pike Columbia, TN 38401 1-800-465-4329 Jameson Inn Columbia 715 James M. Campbell Columbia, TN 34802 1-800-423-7846

Columbia University Resources Academic Commons Columbia University Libraries https://academiccommons.columbia.edu/about Academic Commons provides open, persistent access to the scholarship produced by researchers at Columbia University, Barnard College, Jewish Theological Seminary, Teachers College, and Union Theological Seminary.

CSEE W4840 Final Report Kavita Jain-Cocks kj2264@columbia.edu Zhehao Mao zm2169@columbia.edu Amrita Mazumdar am3210@columbia.edu Darien Nurse don2102@columbia.edu Jonathan Yu jy2432@columbia.edu May 15, 2013 1

5.18 Bilingual translation dictionaries with 10% extra time 67 Chapter 6 Modified papers 69-76 6.1 Modified papers - an overview of the process 69 6.2 Braille papers 72 6.3 Modified enlarged papers 72 6.4 Reasonable adjustments - modified enlarged papers 73 6.5 Coloured/enlarged paper (e.g. A3 unmodified enlarged papers) 73

We take great pleasure in welcoming you to the 37th IEEE Sarnoff Symposium being held in Newark, New Jersey, USA. This year, we received 75 long papers and 12 short papers for review, and accepted 32 long papers and 2 short papers for presentation at the symposium. The acceptance rate was 42.67% for long papers 16.67% for short papers.

papers. There is, however, some paral- lelism in the findings that some 5 per- cent of aff papers appear to be review papers, with many (25 or more) ref- erences, and some 4 percent of all pa- . Fig. 1. Percentages (relative to total number of papers published in 1961) of papers published in 1961 which contain various numbers (n) of .