Voice Extraction By On-line Signal Separation And Recovery .

3y ago
41 Views
3 Downloads
504.74 KB
8 Pages
Last View : 12d ago
Last Download : 3m ago
Upload by : Troy Oden
Transcription

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 7, JULY 1999915Voice Extraction by On-Line SignalSeparation and RecoveryG. Erten, Senior Member, IEEE, and F. M. Salam, Fellow, IEEEAbstract— The paper presents a formulation and an implementation of a system for voice output extraction (VOX) inreal-time and near-real-time realistic real-world applications. Akey component includes voice-signal separation and recoveryfrom a mixture in practical environments. The signal separationand extraction component includes several algorithmic moduleswith a variety of sophistication levels, which include dynamicprocessing neural networks in tandem with (dynamic) adaptivemethods. These adaptive methods make use of optimizationtheory subject to the dynamic network constraints to enablepractical algorithms. The underlying technology platforms usedin the compiled VOX software can significantly facilitate theembedding of speech recognition into many environments. Twodemonstrations are described: one is PC-based and is near-realtime, the second is digital signal processing based and is real time.Sample results are described to quantify the performance of theoverall systems.Index Terms— Adaptive networks, audio signal processing,DSP, gradient descent, independent component analysis, neuralnetworks, nonlinear networks and systems, optimization, speechprocessing, state–space models, statistical independence criteria.I. INTRODUCTIONTHE ABILITY to selectively enhance audio signals ofinterest while suppressing spurious ones is an essentialprerequisite to widespread practical use of far-field voiceactivated systems. Such audio signal discrimination allows forselective amplification of a single source of speech within amixture of two or more signals, including noise and otherspeakers’ voices. Although speech recognition systems havemade significant progress in the last few years, their sensitive dependence on the quality of the voice signal stillprevents them from being widely deployed in man–machinecommunication. In addition, the success rate for recognition,especially in the case of large vocabulary and/or continuousspeech recognition, is simply unsatisfactory to the end userin many environments. Due to the need of having pure voicesignals, the users of these systems need to wear headsets withmicrophone attachments. This is often quite restrictive andunnatural. Thus, freedom from headsets is one significant anddriving concern.For removal of noise from the audio signal transducedthrough a microphone, most traditional signal processing systems use (linear) frequency and filter-based techniques. ThisManuscript received November 1, 1998; revised March 9, 1999. This paperwas recommended by Guest Editors F. Maloberti and R. Newcomb.G. Erten is with IC Tech, Inc., Okemos, MI 48864 USA.F. M. Salam is with the Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824 USA.Publisher Item Identifier S 1057-7130(99)05641-4.approach has obvious limitations, especially when the spectralcontent of the voice overlaps with other sounds (includingthose produced by other speakers) in the background. Bycontrast, the methods introduced here are directly nonlineartime-domain based, which extract and track voice signalsof interest based on alternate signal. One of the primarycomponents of our voice-signal enhancement module involvesthe extraction of a speech signal from transduced soundmixtures. Techniques of this nature have widely been calledindependent signal separation and recovery in the literature[1]–[4].The signal-extraction component may be intuitively described as follows: several unknown but independent temporalsignals propagate through a mixing and/or filtering, natural orsynthetic medium.By sensing the outputs of this medium, a network (e.g.,a neural network, a system, or a device) is configured tocounteract the effect of the medium and adaptively recoverthe original unmixed signals. The property of signal independence, with the possibility of minimizing signal dependency, isassumed for this processing. No additional a priori knowledgeof the original signals is assumed. This processing represents aform of self (or unsupervised) learning. The weak assumptionsand self-learning capability render such a network attractivefrom the viewpoint of real-world applications where (off-line)training may not be practical.The blind-separation approach has great advantages over theexisting adaptive filtering algorithms. For example, when themixture of other signals is labeled as noise in this approach,no specific a priori knowledge about any of the signals isassumed; only that the processed signals are independent. Thisis in contrast to the noise-cancellation method proposed byWidrow et al. [5], which requires that a reference signal becorrelated exclusively to the part of the waveform (i.e., noise)that needs to be filtered out. This latter requirement entailsspecific a priori knowledge about the noise, as well as thesignal(s).The separation of independent sources is valuable in numerous and major applications in areas as diverse as telecommunication systems, sonar and radar systems, audio and acoustics,image/information processing, and biomedical engineering.Consider, e.g., the scenario of audio and sonar signals, wherethe original signals are sounds, and the mixed signals are theoutput of several microphones or sensors placed at differentvantage points. A network will receive, via each microphone,a mixture of sounds that are usually delayed relative to oneanother and/or relative to the original sounds. The network’s1057–7130/99 10.00 1999 IEEE

916IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 7, JULY 1999role is then to dynamically reproduce the original signals,where each separated signal can be subsequently channeledfor further processing or transmission. Similar applicationscenarios can be described in situations involving heart-ratemeasurements, communication in noisy environments, enginediagnostics, and uncorrupted cellular phone communications.The paper is organized as follows: Section II provides abrief description of the problem of speech signal separationin practical hands-free environments. Section III describesmodeling of both the environment and, consequently, theprocessing network using the state–space approach. Feedforward and feedback structures are considered, as well as theirgeneralization to nonlinear parameterized models. Section IVdescribes the framework and derivation of the update laws,expressed as optimization of a performance index subject tothe dynamics of the processing network. This encompassesthe most general and advanced update laws. In fact, families of update laws are analogously generated for a varietyof mixing environments. Section V describes two exampledemonstrations, one focusing on a digital signal processing(DSP) system, and the other on a standard PC platform, bothinterfaced with microphones and speakers. These demos serveas testbeds for the modular codes and methodology whichcan be integrated as part of a system for voice extractionand tracking. In Section VI, we summarize our concludingremarks.has been considered heuristic with suggested adaptation lawsthat have been shown to work mainly in special circumstances.The theory and analysis of prior work pertaining to theHJ algorithm are still not sufficient to support or guaranteethe success encountered in experimental simulations. Theirproposed algorithm assumes a linear static medium withno filtering or delays. Specifically, the original signals areassumed to be transferred by the medium via a matrix ofunknown but constant coefficients. To summarize, the HJmethod: 1) is restricted to the full rank and linear static mixingenvironments; 2) requires matrix-inversion operations; and 3)does not take into account the presence of signal delays. Manyapproaches in the current literature pursue a modeling of themixing medium in the form of an unknown constant matrix. Inmany practical applications, however, delays do occur, and onmany occasions, the medium mixing environment may exhibitnonlinear phenomena. Accordingly, the previous work fails tosuccessfully separate signals in many practical situations andreal-world applications.III. DYNAMIC MODELS FOR THE ENVIRONMENTAND THE PROCESSING NETWORKThe static mixing case studied by many, e.g., [2], [3], [6], islimited to mixing by a constant matrix. If we define the sourceasthen the mixed-signalsignal vectoris defined asvectorII. SEPARATION AND EXTRACTION OF SPEECH SIGNALSThe recovery and separation of independent signal sourcesis a classic but difficult problem in signal processing. Theproblem is complicated by the fact that in practical situations,many relevant characteristics of both the signal sources and themixing medium are unknown. Two main categories of methodsexist in the literature:1) conventional discrete signal processing;2) neurally inspired adaptive algorithms.Conventional signal-processing approaches [1] to signalseparation originate in the discrete domain in the spirit oftraditional digital signal-processing methods that use statisticalproperties of signals. Such signal-separation methods employdiscrete signal transforms and ad hoc filter/transform functioninversion. Statistical properties of the signals in the formof a set of cumulants are used and these cross cumulantsare mathematically forced to approach zero. This constitutesthe crux of the family of algorithms that search for theparameters of, mostly, finite-impulse response (FIR), transferfunctions that recover and separate the signals from oneanother. Calculating all possible cross cumulants, on the otherhand, would be impractical and too time consuming for realtime implementation. In addition, the methods discussed in[1] do not provide an extension beyond the two mixturecase, i.e., three or more signals mixed together cannot beseparated in this fashion. However, a possible extension can bedeveloped, but ends up being computationally too expensive.Neurally inspired adaptive algorithms pursue an approachoriginally proposed by Herault and Jutten, now called the Herault–Jutten (HJ) algorithm [2], [3]. However, the HJ algorithm(1)Separation of statically mixed signals is of limited use because of additional factors involved in superposition of signalsin real mixing environments. Some examples of additionalfactors to be considered include the: 1) propagation time delaysbetween sources and receivers or sensors; 2) nonlinear natureof the mixing functions introduced by the mixing medium aswell as the signal sensors or receivers; and 3) unknown numberof source signals that are to be separated. The methodologydeveloped by our team addresses all three by first extendingthe formulation of the problem to include dynamic modeling ofthe signal mixing/interference medium. The dynamic portionof the mixing model we present in this paper accounts for morerealistic mixing environments, defines their dynamic models,and develops an update law to recover the original signalswithin a comprehensive framework. In the dynamic case, themixing environment is no longer a constant matrix. In fact, thedynamic representation of the signal mixing and separationprocesses takes the problem out of the realm of algebraicequations to the realm of differential (or difference) equations.Several state–space formulations have been reported in [7]–[9]and the references therein.A. A Simple Dynamic RealizationWe now discuss one simple complete formulation as anexample. Recall that a feedback separation structure for thestatic case of (1) is given as (see [2])(2i)

ERTEN AND SALAM: VOICE EXTRACTION BY ON-LINE SIGNAL SEPARATION AND RECOVERYwhich may be rewritten as(2ii)which estimates the originalThis yields output vectorby adaptively updating the entries of asignal sourcesin (2ii), so thatmatrix(2iii)andare permutation matrices of a (nonsingular)whereandwould bediagonal matrix. A special case ofpermutation matrices of the identity.As introduced in [4], we view (2i) as a limit of the dynamicequation917state is of dimensionThe parameter matricesand are of compatible dimensions [8]. This formulation encompasses both continuous-time and discrete-time dynamics.means derivative for continuousThe dot on the statetime dynamics, it however means “advance” for discretetime dynamics. The mixing environment is assumed to be(asymptotically) stable, e.g., the matrix has its eigenvaluesin the left half complex plane in the continuous-time case, andanalogously within the unit (complex) circle in the discretetime case. We now consider two processing network structures.1) The Feedforward Network Structure/Architecture: The(adaptive) feedforward network is proposed to be of thestate–space form(3)where is a small time constant. This facilitates the computation by initializing the differential equation in (3) froman arbitrary guess. It is important, however, to ensure theseparation of time scales between (3) and the adopted updatelike the one defined below by (4). This mayprocedure ofbe ensured by making in (4) sufficiently small. Note thatis the th component of the matrix D.)(4)(8i)(8ii)where is the -dimensional output, is the internal state,and the parameter matrices are of compatible dimensions [8].For simplicity, let us assume that has the same dimensionsSeveral adaptive laws have been reported in [7], [9],as[10] and the references therein.2) The Feedback Network Structure/Architecture: The (adaptive) feedback structure is defined to be of the formandare a pairwhere is sufficiently small, andof a family of odd functions [2], [3]. In particular, we use afamily of functions we developed in [7], which includestheir inverses, or(5)their inverses, or (6)fromWhen using (4) for the static case, one solves for(2ii). For the dynamic case, however, (3) is used instead.The procedure thus enumerates the differential equationsof (3). In addition, the adaptation process for the entries of thematrix can be defined by multiple criteria, e.g., the selectionandin (4). The process facilitates theof functionscomputation by initializing the differential equations from anarbitrary guess, and makes it possible to construct continuouslyadaptive algorithms [4], [7]. Many types of approaches tosolving such differential equations exist. One can distinguishmethods as continuous versus discrete as well as fixed versusvariable step sizes.(9i)(9ii)where is the -dimensional output, is the internal networkstate vector which is of dimension higher than or equal toand the parameter matricesandare of compatiblehas the samedimensions. For simplicity, we assume thatSeveral adaptive laws have been reported indimensions asour previous work in [7], [9], [10] and the references therein.In a more general network framework, the dynamic network can be represented by the nonlinear time-varying staterepresentation with parameters as(10i)(10ii)andare differentiable functions which permitwhereexistence and uniqueness of solutions of the system of equations. Such a nonlinear network may be used to counteract thenonlinearity and dynamics of the environment which may bea generalization of the LTI system in (7i) and (7ii). We noteandrepresent the parameters in the state andalso thatoutput equations, respectively.B. The State–Space Approachand theLet the -dimensional source signal vector be-dimensional measurement vector beLet the mixingenvironment be described by the linear time-invariant (LTI)state–space [4], [8](7i)(7ii)(Note that we have suppressed the dependence on time ofthe variables for simplicity of presentation.) Assume that theIV. FORMULATIONOF THEUPDATE LAWSFor the dynamic environment and processing networks, theoriginal update laws used for the static case or the simpledynamic case can not be expected to work for general cases.The appropriate formulation is to consider an optimizationprocess of the dependence criterion under dynamic networkconstraints.The mutual information of a random vector is a measureof dependence among its components and is defined as ([6],

918IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 46, NO. 7, JULY 1999[7], [9], [10], and the references therein)(11)is the probability density function (pdf) of thewhereThe functionalis always nonnegativerandom vectorand is zero if and only if the components of the random vectorare statistically independent. This important measure definesthe degree of dependence among the components of the signalvector. Therefore, it represents an appropriate functional forcharacterizing (the degree of) statistical independence.can be expressed in terms of the entropy(12)is the entropy ofis thewhereanddenotesmarginal entropy of the component signalthe expected value. We also define the measure of dependenceto be proportional toThe update law is now developed for dynamic environmentsto recover the original signals following the procedures in [7],[9], [10]. Let the network be a continuous-time dynamicalsystem. One defines the performance index as in optimizationtheory [11] as(13)whereis the Lagrangian and is defined as(14)is the “adjoint” state [11]. In general, the dynamicswhereare described by the Euler–Lagrange variational equations(15i)(15ii)andThewith boundary conditionsaccording to the generalparameter update ofgradient (instantaneous) descent form, is(16i)A variant of (16i) used in practice includes a sufficientlysmall leakage (i.e., damping) term as(16ii)As an example, consider the specific case when the environment is considered a linear dynamical system. The network,consequently, may be modeled as a linear (feedforward) dynamical system.The adjoint state equation in that case is given by(17)The functional represents a scaled version of our measureand is a vector constructed of the rowsof dependenceandNote that a canonicalof the parameter matricesrealization [8] may be used so that is constant. The matrixin the canonical representation may have only -parameters,where is the dimension of the state vector The parametersandrepresented generically bywill beupdated using the general gradient descent form.Consequently, using the performance index defined in (12),are thus updated according to [12]the matrices and(18)(19)where may be represented by a scaled version of the identityandare update rate parameters or functions,matrix,is given by a variety of nonlinear expansive orandcompressive odd-functions, which include hyperbolic sineand tangent and their inverses. Or, in general, sigmoidal orinverse of a sigmoidal functions. In the specific computation/approximation performed in [7], [9], [10], one functionused is given as(20)The essential features in using (20) are summarized asfollows:1) it is analytically derived and justified;and thus enables the2) it includes a linear term inperformance of second-order statistics necessary forsignal whitening;3) it contains higher order terms which emanate from thefourth-order cumulant statistics in the output signal4) it does not make the assumption that the output signalhas unity covariance.To our knowledge, the function of (20) represents the onlyanalytically derived function in the literature with the abovecharacteristics, to date. This function, therefore, avoids thelimitations of another analytically derived function reportedin [6]. Computer simulations confirm that the algorithm converges if the function defined in (20) is used. Examples ofcomputer simulations were reported in [7], [9], [10] and thereferences therein.V. DEMONSTRATION ENVIRONMENTSIn this work, the voice-extraction system is demonstratedunder both simulated and real dynamic multimicrophone mixing conditions in two environments: 1) a near-real-time PCenvironment and 2) a real-time DSP environment. In thedemonstration, a tradeoff between efficiency of computationand acceptable performance is judiciously used to render areal-world

Conventional signal-processing approaches [1] to signal separation originate in the discrete domain in the spirit of traditional digital signal-processing methods that use statistical properties of signals. Such signal-separation methods employ discrete signal transforms and ad hoc filter/transform function inversion.

Related Documents:

Advance Extraction Techniques - Microwave assisted Extraction (MAE), Ultra sonication assisted Extraction (UAE), Supercritical Fluid Extraction (SFE), Soxhlet Extraction, Soxtec Extraction, Pressurized Fluid Extraction (PFE) or Accelerated Solvent Extraction (ASE), Shake Flask Extraction and Matrix Solid Phase Dispersion (MSPD) [4]. 2.

Licensing the ENVI DEM Extraction Module DEM Extraction User's Guide Licensing the ENVI DEM Extraction Module The DEM Extraction Module is automatically installed when you install ENVI. However, to use the DEM Extraction Module, your ENVI licen se must include a feature that allows access to this module. If you do not have an ENVI license .

Under "Voice Mail" , you can check and manage your Voice Mail records. 3.1.1 Voice Mail Indicator If there is voice message, there will have an alert in top right hand corner of portal. 3.1.2 Listen Voice Mail Click of the voice message that you want to listen. The voice message will be played by your default Windows Media Player.

5 What Are Lines And Stanzas? Line A line is pretty self-explanatory. Line A line of a poem is when it jumps Line To a new, well, line, Line Like this! Line Sometimes a line is a complete sentence. Line But it doesn’t Line Have to be! Line A stanza is kind of like a paragraph. Line Stanzas are made up of lines. Line This “stanza” has five lines.

follows here is a brief overview of how flowsheet data are used in pinch analysis. Data extraction is covered in more depth in "Data Extraction Principles" in section 10. 3.1 Data Extraction Flowsheet Data extraction relates to the extraction of information required for Pinch Analysis from a given process heat and material balance.

All in all, the DNA extraction labs are very workable. Try some and then decide if you would like to modify any to fit your needs better. Good luck!! Onion DNA Extraction Wheat Germ DNA Extraction Lima Bean Bacteria DNA Extraction Yeast DNA Extraction Thymus DNA

(Yang et al., 2007), extraction of major catechin and caffeine from green tea using different solvents (Perva-Uzunalić et al., 2006), solvent extraction of catechin from Korean tea (Row and Jin 2006), extraction of bioactive compounds from green tea using aqueous extraction (Komes et al., 2010). In addition, comparison of the hot and cold .

F31505-K147-D72 OS Voice V9 Encryption User F31505-K147-D73 OS Voice V9 Product Instance Upgrade F31505-K155-D3 OS Voice V9 Mobile V9 User F31505-K155-D4 OS Voice V9 Mobile V9 User Evaluation F31505-K155-D5 OS Voice V9 Mobile V9 User Upgrade from V3 F31505-K147-D10 OS Voice V9 Basic User F31505-K147-D11 OS Voice V9 Essential User