An Image-based Deep Spectrum Feature Representation

2y ago
13 Views
2 Downloads
581.70 KB
7 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Wade Mabry
Transcription

Session: Experience 1 – Social and Affective MultimediaMM’17, October 23–27, 2017, Mountain View, CA, USAAn Image-based Deep Spectrum Feature Representation for theRecognition of Emotional SpeechNicholas CumminsChair of Complex & IntelligentSystems, University of Passau,Germanynicholas.cummins@ieee.orgAnton BatlinerChair of Complex & IntelligentSystems, University of Passau,Germanyanton.batliner@uni-passau.deShahin AmiriparianChair of Complex & IntelligentSystems, University of PassauMachine Intelligence & SignalProcessing group, TUM, Germanyshahin.amiriparian@tum.deStefan SteidlPattern Recognition Lab, FAUErlangen-Nuremberg, Germanystefan.steidl@fau.deABSTRACTGerhard HagererChair of Complex & IntelligentSystems, University of PassauaudEERING GmbH, Gilching,Germanygh@audeering.comBjörn W. SchullerChair of Complex & IntelligentSystems, University of PassauGroup on Language, Audio & Music,Imperial College London, UKschuller@ieee.orgACM Reference format:Nicholas Cummins, Shahin Amiriparian, Gerhard Hagerer, Anton Batliner,Stefan Steidl, and Björn W. Schuller. 2017. An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech. InProceedings of MM’17, October 23–27, 2017, Mountain View, CA, USA., ,7 pages.DOI: https://doi.org/10.1145/3123266.3123371The outputs of the higher layers of deep pre-trained convolutionalneural networks (CNNs) have consistently been shown to providea rich representation of an image for use in recognition tasks. Thisstudy explores the suitability of such an approach for speech-basedemotion recognition tasks. First, we detail a new acoustic featurerepresentation, denoted as deep spectrum features, derived fromfeeding spectrograms through a very deep image classification CNNand forming a feature vector from the activations of the last fullyconnected layer. We then compare the performance of our novel features with standardised brute-force and bag-of-audio-words (BoAW)acoustic feature representations for 2- and 5-class speech-basedemotion recognition in clean, noisy and denoised conditions. Thepresented results show that image-based approaches are a promisingavenue of research for speech-based recognition tasks. Key resultsindicate that deep-spectrum features are comparable in performancewith the other tested acoustic feature representations in matched fornoise type train-test conditions; however, the BoAW paradigm isbetter suited to cross-noise-type train-test conditions.1INTRODUCTIONConvolutional neural networks (CNNs) have become increasinglypopular in machine learning research. Due to their high accuracy, they are arguably the most dominant approach for large scaleimage recognition tasks [17]. There currently exists a plethoraof pre-trained and open-source deep CNN architectures, such asAlexNet [16] and VGG19 [30] which have been trained on over amillion images, for image classification. AlexNet, in particular, hasbeen revolutionary within computer vision. Consisting of 60 millionparameters, 500, 000 neurons and 5 conventional layers, it achieveda never seen before level of performance in the 2012 ImageNet competition [16, 17]. Most major technology companies now use CNNsfor image understanding and search tasks [14, 17].These pre-trained CNNs are also gaining considerable researchinterest as a feature extractor for a task of interest, e. g. object orscene recognition [6, 29]. It is argued that CNNs, through theirlayered combination of convolutional and pooling layers, capture arobust mid-level representation of a given image, as opposed to lowlevel features such as edges and corners [6, 17]. It has been shownthat deep representation features extracted from the activations oftop layers of AlexNet have sufficient representational power andgeneralisability for image recognition tasks [6]. Indeed, state-ofthe-art results for a range of vision-based classification have beenachieved with such deep representation features [5, 29].The success of CNNs has not been limited to the image domain.In the audio domain, feeding spectrogram representations throughCNNs has been shown to produce suitable salient features for acoustic event detection [3], music onset detection [23], automatic speechrecognition [1, 22], and speech-based emotion recognition [12, 18].CCS CONCEPTS Computing methodologies ! Neural networks; Instance-basedlearning; Applied computing ! Psychology;KEYWORDSconvolutional neural networks, image recognition, spectral features,computational paralinguistics, emotions, realismPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from permissions@acm.org.MM’17, October 23–27, 2017, Mountain View, CA, USA. 2017 ACM. ISBN 978-1-4503-4906-2/17/10. . . 15.00DOI: https://doi.org/10.1145/3123266.3123371478

Session: Experience 1 – Social and Affective MultimediaMM’17, October 23–27, 2017, Mountain View, CA, USA2Spectrogram Speech SignalThese papers however, trained their own CNN architectures, requiring a substantial amounts of data, time and computational power.As a result, research efforts have begun into leveraging pre-trainedimage CNNs to learn suitable speech representations [2, 4, 9, 10].In this regard, this paper explores the suitability of deep spectrumfeatures for speech-based emotion recognition. Deep spectrum features are derived from forwarding spectrograms through AlexNetand using the activations from the second fully connected layer(fc7) [16] as a feature vector. This approach has shown to be suitable in other computational paralinguistic tasks such as snore soundrecognition [2, 9] and autism severity classification [4] but has yetto be explored for emotion classification.We compare the efficacy of the deep spectrum features withtwo sort of standard acoustic feature representations: the small buttailor made for emotion recognition extended Geneva MinimalisticAcoustic Parameter Set (eGeMAPS) [7]; and the large brute-force2013 Interspeech Computational Paralinguistics Challenge featuresset (C OM PAR E) which can be considered an omnibus feature setfor paralinguistic tasks [8]. We also compare with a bag-of-audioword (BoAW) representation [25], which has produced state-ofthe-art results for continuous emotion prediction [24]. Finally, asresults presented in [12] indicate that (speech) CNN features arepotentially more robust to the effects of environmental noise thanmore established speech features, we test all feature representations,using three versions – clean, noisy and denoised – of the FAU-AIBOEmotion Corpus [31], in both a 2- and a 5-class set-up.The rest of this paper is structured as follows: the deep spectrumfeature extraction procedure is outlined in Section 2; the experimental settings, including a detailed database description, are givenin Section 3; the results and corresponding discussion are presentedin Section 4; finally, a brief conclusion and future work directionsare given in Section 51 x Conv.size: 11; ch: 96; stride: 4maxpoolingAlexNet1 x Conv.size: 5; ch: 2563 x Conv. Layers1 - size: 5; ch: 3842 - size: 5; ch: 3843 - size: 5; ch: 256maxpoolingDEEP SPECTRUM FEATURESfully connected fc6 4 096 neuronsAs already mentioned, deep spectrum features are derived fromforwarding spectrograms through AlexNet and using the activationsfrom the second fully connected layer (fc7) [16]; an overview of theirextraction is provided in Figure 1. It is worth noting that spectral andcepstral features are widely used, not only in speech-based emotionliterature, but in speech processing in general [15, 20, 26, 28].2.1maxpoolingfully connected fc7 4 096 neurons4096 DimensionalDeep Spectrum FeaturesFigure 1: Overview of the deep spectrum feature extractionprocedure. Spectrograms are generated from whole audio filesand then fed through the pre-trained image classification CNNAlexNet. The activations of AlexNet’s last fully connected layer,fc7, are used to form the deep spectrum feature vectors. Abbreviations: conv denotes convolutional layers and ch denoteschannels.Spectrogram CreationThe first stage of the extraction procedure is to create spectrogramsin a suitable format for processing AlexNet. A spectrogram is a2-dimensional visual representation of the time varying spectralcharacteristics of an audio signal [20]. To create the plots, we usethe Python package matplotlib [13] with the following settings: theFast Fourier Transform (FFT) is computed using a window size of256 samples with an overlap of 128 samples; we use a Hanningwindow function and compute the power spectral density on thedB power scale. The spectrograms are then plotted using a viridiscolour mapping which is a perceptually uniform sequential colourmap varying from blue (low range) to green (mid range) to yellow(upper range). Results presented in [2] demonstrate the suitabilityof this colour mapping for extracting deep spectrum features overother candidates such as jet or greyscale. Finally the plots are scaledand cropped to square images without axes and margins to complywith the input needs of AlexNet. Our spectrograms have a scale of227 227 pixels.2.2Deep Feature ExtractionHaving created the spectrogram plots, the next step is to createthe feature representation. For this we use the publicly availabletoolkit Caffe [14] to obtain the models and weights for AlexNet [16].479

Session: Experience 1 – Social and Affective MultimediaMM’17, October 23–27, 2017, Mountain View, CA, USATable 1: The 42 low-level descriptors (LLD) provided in theeGeMAPS acoustic feature set.Table 2: The 65 low-level descriptors (LLD) provided in theC OM PAR E acoustic feature set.1 energy related LLDGroup4 energy related LLDGroupSum of auditory spectrum (loudness)Prosodic25 spectral LLDGroupa ratio (50–1 000 Hz / 1-5 k Hz)Energy slope (0–500 Hz, 0.5–1.5 k Hz)Hammarberg indexMFCC 1–4Spectral FluxSpectralSpectralSpectralCepstralSpectralSum of auditory spectrum (loudness)Sum of RASTA-filtered auditory spectrumRMS Energy, Zero-Crossing Rateprosodicprosodicprosodic55 spectral LLDGroup6 voicing related LLDGroupF0 (Linear & semi-tone)Formants 1, 2, (freq., bandwidth, ampl.)Harmonic difference H1–H2, H1–A3log. HNR, Jitter (local), Shimmer (local)ProsodicVoice QualityVoice QualityVoice QualityRASTA-filt. aud. spect. bds. 1–26 (0-8 k Hz)MFCC 1–14 cepstralSpectral energy 250–650 Hz, 1 k–4 k HzSpectral Roll-Off Pt. 0.25, 0.5, 0.75, 0.9Spectral Flux, Centroid, Entropy, SlopePsychoacoustic Sharpness, HarmonicitySpectral Variance, Skewness, ectralspectral6 voicing related LLDGroupF0 (SHS & Viterbi smoothing)Prob. of voicinglog. HNR, Jitter (local & DDP), Shimmer (local)prosodicvoice qualityvoice qualityAlexNet was the first large, deep CNN to be successfully appliedto the ImageNet task in 2012; in both classification and localisationtasks, it secured first place with almost half the error rates of the bestconventional image analysis approach [17]. AlexNet consists of fiveconvolutional layers of varying kernel sizes, followed by three fullyconnected layers, the last of which is used to perform the 1 000-wayclassification required for the ImageNet tasks by applying a softmaxfunction.For the deep spectrum feature extraction, the spectrogram plotsare forwarded through the pre-trained networks and the activationsfrom the neurons on the second fully connected layer fc7 are extracted as feature vectors (cf. Figure 1). The resulting feature sethas 4 096 attributes, one for every neuron in the AlexNet’s fully connected layer. Results presented in [2] demonstrate that AlexNet isbetter suited for deep spectrum feature generation than VGG19 [30].3of 42 low-level-descriptors (LLDs) as described in Table 1. For fulldetails, the reader is referred to [7].C OM PAR E is a large (high dimensional) brute-forced acousticfeature set containing 6 373 static features (i. e. functionals) oflow-level descriptor (LLD) contours. An overview of the prosodic,spectral, cepstral, and voice quality LLD’s is given in Table 2. Thefunctionals applied to the LLD contours include the mean, standarddeviation, percentiles and quartiles, linear regression functionals,and local minima/maxima related functionals. For full details, thereader is referred to [8].BoAW is a sparse audio representation formed by the quantisation (bagging) of acoustic LLDs; each frame-level LLD vector isassigned to an audio word from a codebook learnt from some training data. Counting the number of assignments for each audio word,a fixed length histogram (bag) representation of an audio clip isgenerated. The histogram represents the frequency of each identifiedaudio word in a given audio instance [25]. Due to the quantisationstep, BoAW representations can be considered more robust thanLLDs. The sparsity of the final feature representation can be controlled by two parameters: the codebook size (Cs) which determinesthe dimensionality of the final feature vectors, and the number ofassignments (Na) which determines the number of words assignedto an audio instance. For further details on BoAW formation thereader is referred to both [24, 25].EXPIREMENTAL SETTINGSThis section outlines the key experimental settings – feature representations (Section 3.1), the FAU-AIBO Emotion Corpus (Section 3.2),the denoising solution (Section 3.3) and the classification set-up(Section 3.4) – used to generate the presented results.3.1Feature RepresentationsAll results are presented on four different utterance level acoustic feature representations. In addition to the deep spectrum features previously outlined (cf. Section 2), we also test the extendedGeneva Minimalistic Acoustic Parameter Set (eGeMAPS) [7], the2013 Interspeech Computational Paralinguistics Challenge features set (C OM PAR E) [8], and the bag-of-audio-words (BoAW)paradigm [25]. All three conventional acoustic representationstested have been shown to be suitable for emotion recognitiontasks [5, 8, 11, 19, 21, 24].eGeMAPS is a small (low dimensional) knowledge-based acoustic feature sets purposely designed to have a high level of robustnessfor capturing emotion from speech [7]. It consists of 2 functionaldescriptors, arithmetic mean and the coefficient of variation, of a set3.2Emotional Speech DatabaseDespite being a well known challenge for speech-based emotionrecognition [28], there is still a comparative lack of studies whichaddress this task in realistic data conditions. In this regard, we testall feature representations using three versions of the popular FAUAIBO Emotion Corpus (FAU-AIBO). This database is a corpus ofGerman children communicating with Sony’s AIBO pet robot [31].The speech is spontaneous as the children were instructed to talk480

Session: Experience 1 – Social and Affective MultimediaMM’17, October 23–27, 2017, Mountain View, CA, USATable 3: The two different emotion categories – Idle (IDL) andNegative (NEG) – and the number of training and test utterances in each for the FAU-AIBO Emotion CorpusClassIDLNEGTotalTrain5 9663 2249 190Test5 4682 4187 886Table 4: The five different emotion categories and the numberof training and test utterances in each for the FAU-AIBO Emotion CorpusTotal11 4345 64217 076EmotionAngryEmphaticNeutralPostiveRestTotalto AIBO as they would a friend. The robot was controlled in awizard-of-oz scenario and the human operator would sometimesmake AIBO deliberately misbehave in order to provoke an emotionalreaction from the child participant. The data was recorded with botha close talk (clean) and a room (noisy) microphone from a videocamera approximately at 3m distance from the participant. The noisyrecordings contain a range of reverberation and background noises;we therefore also test all features on a de-noised (densd) versionof these recordings cleaned with a state-of-the-art recurrent neuralnetwork speech enhancement system (cf. Section 3.3).The corpus can be divided into speaker independent training andtest partitions of either 2- or 5- emotional classes (cf. Tables 3and 4). Due to the presence of reverberation and background noisesrendering some of the noisy speech samples inaudible, there is agreater amount of clean utterances. To ensure a matched number ofutterance in each conditions, we only used clean recordings wherethere was a matched noise recording. The number of utterancesper emotion in the train and test partitions is given for the 2-classproblem in Table 3, and for the 5-class problem in Table 4. Note thatthe 2009 Interspeech Computational Paralinguistics Challenge [27]used the complete set of clean utterances (a total of 18 216 utterances); therefore the results presented in this paper are not directlycomparable with those found using the 2009-challenge data.3.3Test6001 4815 0822065177 886Total1 4393 49410 1088391 19617 076using our open-source openXBOW toolkit [25]. An extensive iterative search was performed to identify the codebook size (Cs 2{10, 20, 50, 100, 200, 500, 1 k, 2 k, 5 k}) and number of assignments(Na 2 {10, 20, 50, 100, 200, 500}), with random assignments beingused to generate all codebooks. The deep spectrum features wereextracted as per Section 2.All feature representations were fed into a linear support vectormachine (SVM) implemented using the scikit-learn toolbox1 . TheSVMs were trained using stochastic gradient descent, with the gradient of the loss being estimated per sample and the model beingsequentially updated. The regularisation term (a) was optimisedon a scale from {1, 2, 5} · 10 6 to {1, 2, 5} · 101 using a speakerindependent 2-fold cross validation procedure on the training set2 .As in [27], all results reported are for the FAU-AIBO test set withthe corresponding models trained on the full training set (cf. Table 3and Table 4). Results are given in terms of Unweighted AverageRecall (UAR); this is the standard measure of the Interspeech Computational Paralinguistics Challenges and is suitable for use whenthe distribution among classes is not balanced. We also investigatethe effect of upsampling the minority class(es) to overcome potentialeffects of the class imbalances. All minority class(es) are randomlyupsampled to be 0.75 the size the majority class; this factor wasdetermined empirically in preliminary investigations.Speech EnhancementTo test the effect of denoising on the different feature representations,the noisy data is filtered based on a long short-term memory (LSTM)deep recurrent neural network (DRNN) architecture proposed in [34,35]. This network consists of 100 input neurons according to theinput feature dimensionality, which means 100 Mel spectra extractedfrom the noisy speech data. This is then followed by three LSTMRNN layers of 256 neurons interspaced by feed-forward layers of64 neurons and hyperbolic tangent activations. The output is a 100dimensional mask, which allocates which frequency band shouldbe suppressed and which should be enhanced. The network wastrained on several noisy and reverberated versions of the AudioVisual Interest Corpus; for full details the reader is referred to [34].3.4Train8392 0135 0266336799 1904RESULTSWhen using the clean and unbalanced training data the eGeMAPSfeatures achieved (clean) test set UARs of 0.630 and 0.268 for the 2and 5-class set-ups, respectively (cf. Table 5). Interestingly in theother matched-noise-type systems we observe a slight increase in the(unbalanced) 2-class UARs; these conditions achieved the strongest2-class UARs of 0.655 for this feature set. Random oversamplingappears to be more beneficial in the 5-class set-up than in the 2-classset-up when using eGeMAPS

for image understanding and search tasks [14, 17]. These pre-trained CNNs are also gaining considerable research interest as a feature extractor for a task of interest, e.g. object or scene recognition [6, 29]. It is argued that CNNs, through their layered c

Related Documents:

L2: x 0, image of L3: y 2, image of L4: y 3, image of L5: y x, image of L6: y x 1 b. image of L1: x 0, image of L2: x 0, image of L3: (0, 2), image of L4: (0, 3), image of L5: x 0, image of L6: x 0 c. image of L1– 6: y x 4. a. Q1 3, 1R b. ( 10, 0) c. (8, 6) 5. a x y b] a 21 50 ba x b a 2 1 b 4 2 O 46 2 4 2 2 4 y x A 1X2 A 1X1 A 1X 3 X1 X2 X3

We will begin with an overview of spectrum analysis. In this section, we will define spectrum analysis as well as present a brief introduction to the types of tests that are made with a spectrum analyzer. From there, we will learn about spectrum analyzers in term s of File Size: 1MBPage Count: 86Explore furtherSpectrum Analysis Basics, Part 1 - What is a Spectrum .blogs.keysight.comSpectrum Analysis Basics (AN150) Keysightwww.keysight.comSpectrum Analyzer : Working Principle, Classfication & Its .www.elprocus.comFundamentals of Spectrum Analysis - TU Delftqtwork.tudelft.nlRecommended to you b

Absorption Spectrum of Hydrogen Emission Spectrum of Mercury Emission Spectrum of Lithium Emission Spectrum of Helium Emission Spectrum of Hydrogen Spectrum of White Light Line Spectrum: hf Eu-El Equations Associated with The Bohr Model Electron's angular momentum L Iω mvrn nh/2π, n 1,2,3 n is called quantum number of the orbit Radius of a .

Actual Image Actual Image Actual Image Actual Image Actual Image Actual Image Actual Image Actual Image Actual Image 1. The Imperial – Mumbai 2. World Trade Center – Mumbai 3. Palace of the Sultan of Oman – Oman 4. Fairmont Bab Al Bahr – Abu Dhabi 5. Barakhamba Underground Metro Station – New Delhi 6. Cybercity – Gurugram 7.

2.3 Deep Reinforcement Learning: Deep Q-Network 7 that the output computed is consistent with the training labels in the training set for a given image. [1] 2.3 Deep Reinforcement Learning: Deep Q-Network Deep Reinforcement Learning are implementations of Reinforcement Learning methods that use Deep Neural Networks to calculate the optimal policy.

spectrum, while the part of the spectrum that can be seen is th e visible spectrum. NASA relies on a range of tools for communicating and creating images utilizing almost every single component of the electromagnetic spectrum in one way or another. This report is designed to provide a better unde rstanding of basic issues for NASA spectrum access

the spectrum used by Federal agencies is not a sustainable model for spectrum policy” “The essential element of this new Federal spectrum architecture is that the norm for spectrum use should be sharing, not exclusivity”

IBM Spectrum Control Tivoli Storage Productivity Center (TPC) and management layer of Virtual Storage Center (VSC) IBM Spectrum Protect Tivoli Storage Manager (TSM) IBM Spectrum Archive Linear Tape File System (LTFS) IBM Spectrum Virtualize SAN Volume Controller (SVC) IBM Spectrum Accelerate Software from