Multi-task Multi-modal Models For Collective Anomaly . - Ide-research

1y ago
10 Views
2 Downloads
909.98 KB
10 Pages
Last View : 22d ago
Last Download : 3m ago
Upload by : Sabrina Baez
Transcription

Multi-task Multi-modal Models for Collective Anomaly DetectionTsuyoshi IdéDzung T. PhanJayant KalagnanamIBM ResearchT. J. Watson Research CenterEmail: tide@us.ibm.comIBM ResearchT. J. Watson Research CenterEmail: phandu@us.ibm.comIBM ResearchT. J. Watson Research CenterEmail: jayant@us.ibm.comAbstract—This paper proposes a new framework for anomalydetection when collectively monitoring many complex systems.The prerequisite for condition-based monitoring in industrialapplications is the capability of (1) capturing multiple operational states, (2) managing many similar but different assets,and (3) providing insights into the internal relationship of thevariables.To meet these criteria, we propose a multi-task learningapproach based on a sparse mixture of sparse Gaussiangraphical models (GGMs). Unlike existing fused- and grouplasso-based approaches, each task is represented by a sparsemixture of sparse GGMs, and can handle multi-modalities.We develop a variational inference algorithm combined witha novel sparse mixture weight selection algorithm. To handleissues in the conventional automatic relevance determination(ARD) approach, we propose a new 0 -regularized formulationthat has guaranteed sparsity in mixture weights. We showthat our framework eliminates well-known issues of numericalinstability in the iterative procedure of mixture model learning.We also show better performance in anomaly detection taskson real-world data sets. To the best of our knowledge, this isthe first proposal of multi-task GGM learning allowing multimodal distributions.1. IntroductionKeeping good operational conditions of industrial equipment is a major business interest across many industries.Although detecting indications of system malfunctions fromnoisy sensor data is sometimes challenging even to seasonedengineers, statistical machine learning has a lot of potentialto automatically capture major patterns of normal operatingconditions for condition-based monitoring (CbM). In a typical setting, physical sensor data from multiple sensors aretaken as the input, and anomaly scores, numerical valuesrepresenting the degree of anomalousness of the operationalstate, are computed. Then human administrators decide totake actions to mitigate the risk of e.g. service interruptions.We are interested in the scenario where there is a collection of many assets that are similar but not identical, andwe wish to develop a comprehensive monitoring system byleveraging the commonality of those assets while payingattention to the individuality of each. This is a frequently en-system1system21system3systemS 0KmetamodelFigure 1. Overall model structure of the multi-task multi-modal model forcollective condition-based monitoring.countered problem in Internet-of-Things (IoT) applications.For example, a car company may wish to build a fault detection model for thousands of electric vehicles in a certainarea. Since the occurrence of vehicle malfunction is quiterare, it is tempting to combine information from individualvehicles to get some common insights. On the other hand,since driving conditions can be significantly different foreach driver, the model should capture the individuality ofthe individual vehicles.To formalize the task of anomaly detection for a collection of systems, we leverage the framework of multitask learning (MTL) [1]. In practical CbM scenarios, firstly,anomaly detection models must not be a black box. Adetection model has to provide quantitative insights intothe individual role of each variable, based on which humanoperators can see what is really happening in the system.Secondly, a detection model must handle a variety of normal operating conditions, i.e. multi-modality. For example,sensor data from an electric vehicle may have drasticallydifferent statistical natures between when starting the engineand when cruising on highways.To this end, we focus on multi-task learning of Gaussian graphical models (GGMs). Thanks to sparsity-enforcingregularization techniques [2], [3], GGMs are known to be apowerful tool in anomaly detection from the viewpoint ofinterpretability and robustness to the noise [4], [5]. To learnGGMs in the multi-task setting, there have been proposedmainly three techniques so far: group-lasso-based [6], [7],

[8], fused-lasso-based [9], [10], and Bayesian methods [11].However, most of the existing studies aim at learning asingle common graph across the tasks and are unable tohandle multi-modal natures of the real world.The main motivation of this paper is to extend existingwork to be able to handle multi-modalities and to proposea practical framework for collective CbM. As illustrated inFig. 1, our model lets all the S tasks (or systems) share theK sparse GGMs as a “pattern dictionary.” The individualityof each task is represented by the mixture weights overthose K patterns. The mixture weights and the K GGMsare learned from data based on a Bayesian formulation.The contribution of this paper is threefold: The first proposal of a multi-task multi-modal GGMlearning model.The first derivation of a variational Bayes algorithmhaving a guaranteed sparsity in both variable relationship and mixture weights.The first proposal of a practical CbM framework fora fleet of assets.Regarding the second point, we propose a novel 0 regularized formulation for mixture weight determination. This indeed sheds a new mathematical light onthe traditional notion of automatic relevance determination(ARD) [12], [13], [14] for Bayesian mixture models.2. Problem setting2.1. Data and notationsWe are given a training data set D D1 . . . DS ,where Ds is the data set for the s-th system or task (the termtask is used interchangeably with system in this paper). Dis assumed to be collected under the normal conditions ofthe systems. Each Ds is a set of N s samples asDs {xs(n) RM n 1, . . . , N s },(1)where M is the dimensionality of the samples (or thenumber of sensors), which is assumed to be the sameacross Pthe tasks. We let S be the total number of tasks andSsN s 1 N be the total number of samples. We usethe superscript to represent the sample and task indexes.Vectors are represented with the bold face, e.g. xs(n) s(n)s(n)s(n)(x1 , . . . , xi , . . . , xM ) , and matrices are representedwith the sans serif face, e.g. Λk (Λki,j ). The elementsof vectors and matrices are denoted with the subscripts. Asoutlined in Introduction, we use a mixture model to capturemulti-modalities. Each mixture component is indexed typically by k (and sometimes l), which appears either as thesuper- or subscript (see Sec. 3 for the detail).2.2. Anomaly scoreOur goal is to compute the anomaly score for a (setof) new sample(s) observed in an arbitrary task. For a newsample x in the s-th task, following [15], we define theoverall anomaly score asas (x) ln ps (x D),(2)up to unimportant additive and multiplicative constants,where ps (· D) is the predictive distribution of the s-thtask, which is to be learned based on the training data D(eventually given by Eq. (34)).In addition to the overall anomaly score, we also definethe variable-wise anomaly scores using the negative logconditional predictive distribution asasi (x) ln ps (xi x i , D),(3)asiwheredenotes the anomaly score for the i-th variableat the s-th task, and x i (x1 , . . . , xi 1 , xi 1 , xM ) . Tocompute this, we need detailed information on the variabledependency. Unlike other outlier detection methods suchas one-class support vector machines [16], GGMs providesa clear-cut way of computing the predictive conditionaldistribution. This is a major reason why we focus on GGMbased anomaly detection methods in this paper. As longas using GGM as a basic building block of the model,the conditional distribution can easily be obtained via thestandard partitioning formula of Gaussians [13].Since sensor data of industrial applications are verynoisy in general, we are often interested in averagedanomaly scores over a sliding window. If we denote thewindow by Dtest , the averaged version of the anomalyscores are defined asX1sln ps (xs D),(4)as (Dtest) s Dtest s sx DtestX1sasi (Dtestln ps (xi xs i , D),(5)) s Dtest s sx Dtestwheres Dtests.is the size of the set Dtest2.3. Motivating exampleTo fully understand the need for multi-task multi-modalmodels in real applications, consider a one-dimensional (1D)two-task example. To be specific, imagine we are monitoringtwo vehicles (two tasks) through the temperature (singlevariable) of a wheel axle of each car. In Fig. 2, the top rowis for Vehicle 1 and the bottom row is for Vehicle 2. Thehistograms in the same row are all the same, showing theempirical distribution (i.e. ground truth) of the temperature.Since driving conditions should be different between Vehicle1 and 2, the histogram for Vehicle 1 is different from thatof Vehicle 2. Possibly due to weather conditions (rainy ornot), it is likely for the temperature to have a bi-modaldistribution. This is a simple example of multi-task andmulti-modal situations.To fit the empirical distribution, the figure comparesthree different approaches: multi-task and multi-modal(MTL-MM), non-MTL Gaussian mixture (GMM), andsingle-modal MTL models, corresponding to the columns.

1.51.5340.5-10234-10123401234Task 2: MTL0.50.00.51.0Task 2: GMM0.00.00.51.0Task 2: MTL-MM11.521.011.501.5-1Task 1: MTL0.00.00.51.0Task 1: GMM1.01.50.00.51.0Task 1: MTL-MM0123401234Figure 2. Example of multi-modal distributions with task-dependence. One variable (M 1) and two task (S 2) case is shown. The histograms showthe empirical distribution (ground truth) and are fit by three different models: multi-task multi-modal (MTL-MM), standard Gaussian mixture (GMM),and multi-task learning (MTL) models. Only MTL-MM can capture the multi-modality in a task-dependent fashion.The curves illustrate typical results of fitting. As shown inthe figure, traditional non-MTL Gaussian mixture models(GMM; second column) disregard the individuality of thetasks, and existing GGM-based MTL models (third column)cannot handle the multi-modality since their goal is tofind a single precision matrix on a task-wise basis. It isclear that these models lead to significant error in anomalydetection. Our goal is to develop a GGM-based MTL modelthat is capable of handling the multi-modality while takingadvantage of task-relatedness. Although this is an illustrationwith a 1D model, we are interested in modeling multivariatesystems, i.e., M 1.3. Multi-task sparse GGM mixtureTo capture multiple operational states of the systems, weemploy a novel probabilistic Gaussian mixture model featuring double sparsity: sparsity in the dependency structureof GGM and sparsity over the mixture components. Thissection focuses mainly on the former along with the overallframework.3.1. Observation model and priorsWe employ a Bayesian Gaussian mixture model havingK mixture components. First, we define the observationmodel of the s-th task byp(xs z s , µ, Λ) KYsN (xs µk , (Λk ) 1 )zk ,(6)k 1where µ and Λ are collective notations representing {µk }and {Λk }, respectively. Also, z s is the indicator variableof cluster assignment. As usual, zks {0, 1} for all s, andPKsk 1 zk 1.We place the Gauss-Laplace prior on (µk , Λk ) and thecategorical distribution on z :p(µk , Λk ) N (µk m0 , (λ0 Λk ) 1 ) Lap(Λk ρ), ρ M 2 ρ Lap(Λk ρ) exp kΛk k1 ,42KKYXsp(z s π s ) (πks )zk s.t.πks 1, πks 0,k 1(7)(8)(9)k 1Pwhere kΛk k1 i,j Λki,j . The parameter π s is determinedas a part of the model while ρ, λ0 , m0 are given constants.From these equations, we can write down the completelikelihood asP (D,Z, Λ, µ π) KYp(µk , Λk )k 1s S YNYp(z s(n) π s )p(xs(n) z s(n) , µ, Λ),(10)s 1 n 1where z s(n) is the cluster assignment variable for the n-thsample in the s-th task. π and Z are collective notations fors(n){π s } and {zk }, respectively.Note that {µk } and {Λk } are not task-specific andshared by all the tasks. It is the cluster assignment probability π s that reflects the individuality of the tasks. Thusπ s can be used as the signature of the s-th task.3.2. Variational Bayes inferenceA general goal of Bayesian formulations is to find theposterior distributions. We leverage the variational Bayes(VB) approach [13] to get a tractable algorithm. The centralassumption of VB is that the posterior distribution has a

factorized form. In this case, we assume the categoricaldistribution for Z and the Gauss-delta distribution for (µ, Λ):by arranging the terms of hln P iZ related to µk , we readilygetsNk sq(Z) q(µ, Λ) N YKS YYs 1 n 1 k 1KYks(n)(rk)s(n)zk,(11)x̄k kk 1N (µ m , (λk Λ )kk)δ(Λ Λ̄ ),(12)s(n)ln q(Z) c. hln P (D, Z, Λ, µ π)iΛ,µ ,ln q(Λ, µ) c. hln P (D, Z, Λ, µ π)iZ ,(13)(14)where c. symbolically represents an unimportant constant,h·iΛ,µ is the expectation w.r.t. q(µ, Λ), and h·iZ is theexpectation w.r.t. q(Z).To compute these expectations, we need to know thevalue of π . In the proposed VB framework, an optimizationproblem to determine π is solved alternately with Eqs. (13)and (14) until convergence. We will discuss the details inSection 4.3.3. VB iterative equationss(n)rks 1 n 1NsS XX1NkoM ln πks N (xs(n) mk , (Λ̄k ) 1 ) 2λks(n)rs(n)rk Pk k s(n) .l 1 rls(n)rkxs(n) ,(18)s 1 n 1(19)(20)s(n)given {Λk , rk }.For q(Λ), the VB equation does not have an analyticsolution. We instead find the mode of ln q(Λ) by solving ρkΛk k1 , (21)Λ̄k arg max ln Λk Tr(Λk Qk ) NkΛkwithsS N1 X X s(n) s(n) s(n) xxr x̄k (x̄k ) Σ Nk s 1 n 1 kk(22)λ0 k(x̄ m0 )(x̄k m0 ) .(23)λkAs shown in [2], the objective function in Eq. (21) is convex.This means that the posterior q(Λ) is guaranteed to beunimodal, and approximating q(Λ) by the delta function isreasonable.As stated earlier, the VB iterative equations (15)-(23)are combined with point-estimation of π s . The next sectiondiscusses the details of the approach.Q k Σk 4. Sparse mixture weight selectionThis section introduces a novel formulation to find asparse solution for {π s }.To determine π , the conventional VB formulation [13]maximizes hln P (D, Z, Λ, µ π)iΛ,µ,Z under the normalization condition. With Eqs. (9) and (10), we readily havessn(17)4.1. Conventional ARD approachNow let us find explicit expressions of the VB equations (13) and (14). Given {mk , λk , Λ̄k } and an initializedπ s , the first VB equation (13) givess(n)ln rk,λk λ0 Nk ,1(λ0 m0 Nk x̄k ),mk λkk 1where δ(·) is Dirac’s delta function and {rk , mk , λk , Λ̄k }are model parameters to be learned. We combine VB analysis for {Z, µ, Λ} with point estimation for the mixtureweight π s .In the VB formulation, the model parameters{mk , λk , Λ̄k } are determined so that the Kullback-Leibler(KL) divergence between q(Z)q(µ, Λ) and P (D, Z, Λ, µ π)is minimized. It is well-known [13] that minimizationof the KL divergence leads to extremely simple iterativeequations:NS XXhln P (D, Z, Λ, µ π )iΛ,µ,Z c. To get the first equation, we calculated the expectationw.r.t. µk and Λk using the expression of Eq. (12). ThesecondP s equation is due to the normalization conditionk πk 1.To solve the second VB equation (14), we first decompose the posterior as q(µ, Λ) q(µ Λ)q(Λ). For q(µk Λk ),s(n)hzkiZ ln πksn 1 k 1(15)(16)N XKXas a function of π s . The expectation is computed usings(n)s(n)Eq. (11) as hzk iZ rk . Now the optimization problemwe solve readsKXmaxcsk ln πks s.t. kπ s k1 1,(24)sπk 1swhere kπ k1 is the 1 norm of π s , and we definedscskN1 X s(n) srN n 1 k(25)

PKsk 1 ck 1 holds.By introducing a Lagrange multiplier for the constraintkπ s k1 1, it is straightforward to show that the optimalsolution π s is given bysoπks csk .(26)As discussed in [14], when combined with a smallthreshold value below which πks is regarded as zero, theproblem (24) often gives a sparse solution, which is sometimes referred to as an instance of the automatic relevancedetermination (ARD). However, πks cannot be mathematically zero because of the logarithm function. This meansthat the sparsity is governed by the heuristically providednumerical threshold and the convergence of the VB algorithm may depend on chance. In fact, the conventional VBiterative algorithm is known to be sometimes numericallyunstable. This can be a serious issue especially in the multitask anomaly detection scenario since we need to manageS different anomaly detection models at once.Keeping this fundamental limitation of the conventionalVB formulation in mind, we introduce a new formulationfor sparse mixture weight selection in the next subsection.4.2. Convex mixed-integer programming approachTo achieve sparsity in a mathematically well-definedfashion in (24), first, we explicitly impose regularizationon π s . Similarly to the Laplace prior on Λk , let us formally assume that π s has a prior in the form of p(π s ) exp( τ kπ s k0 /N s ), where k · k0 denotes the 0 -norm (thenumber of nonzeros), and τ 0 is a constant assumed tobe given. The optimization problem now looks like:(K)Xmaxcsk ln πks τ kπ s k0s.t. kπ s k1 1. (27)sπk 1Obviously, the solution (26) is recovered when τ 0. Notethat we cannot use the 1 norm here because of the constraintkπ s k1 1.Second, we formally define the notion of -sparsity:Definition 1. For a given small , a vector x is called an -sparse solution if many elements satisfy xi .Third and finally, we modify the problem (27) into aconvex mixed-integer programming (MIP) to get an -sparsesolution:maxπ s ,y sKX{cskk 1yks ln πks τ yks }s.t.KX ,yksTheorem 1 (Convex MIP mixture weight selection).(i)(ii)(iii)The problem (28) is a convex mixed-integer programming with a bounded polyhedron feasible set.The problem (28) generates an -sparse solution fora suitable selection of τ .There exist small enough positive numbers τ and such that (26) is a solution of (28).4.3. Solving Eq. (28)Although solving MIP generally involves exhaustivecombinatorial search and thus computationally very expensive, we can derive an efficient algorithm for the problem (28). The strategyP is simple. We find a solution of (28)for each value of k yks , and pick the best one from them.This is a practical approach since K is on the order of 10in most CbM scenarios. In this subsection, we illustrate theoutline of the approach. For proofs and more detailed discussions, the reader can refer to our companion paper [17].Without loss of generality, we can assume that {csk } havebeen sorted in increasing order, cs1 · · · csK . Since theobjective is symmetric w.r.t. k , in order to remove duplicatedsolutions, we also assume πis πjs , when csi csj and i j .In that case, since we are solving a maximization Pproblem,we intuitively expect that, for a given K0 K k yks ,sssy1s · · · yK 0, yK · · · yK 100 1 {0, 1} for k 1, . . . , K, (28)where 0 1 is another constant controlling the sparsity.y s plays a role of indicator variable of π s . Notice that theinequality constraint yks πks grantees that yks 1 whenπks and yks 0 when πks . The latter follows fromthe fact that yks 1 gives a smaller objective value and canbe ignored when seekingan optimal solution. Thus we seePKthat kπ s k0 is equal to k 1 yks .(29)because this choice keeps as many larger csk ’s as possible.Based on this, we can eliminate y s from (28) to define aK0 -specific problem:πks 1,k 1πksWe also see that the problem (28) is convex. By directlycalculating the second derivative w.r.t. π s and y s , we seethat the Hessian is a 2K 2K diagonal matrix whosediagonal element is either csk /(πks )2 (k 1, . . . , K ) or 0.Since csk 0, the Hessian is negative semi-definite. Also, alldecision variables are bounded in [0, 1], and every constraintis linear. Thus the problem (28) is a convex mixed-integerprogramming with a bounded polyhedron feasible set.Since is just an explicit representation of the thresholdvalue that has been used heuristically [13] and the objectivefunction is dominated by the first term when τ 0 is small,we conclude that the problem (28) is a mathematically welldefined surrogate of the original (24).Let us formally summarize the above discussion:maxsπKXcsk ln πks s.t.k 1KXπks 1,k 1πks for k 1, . . . , K0 .(30)To find the optimality condition, we define the Lagrangefunction asL(π s , αs , η s ) KXk 1csk ln πks η sKXk 1πks K0Xαks ( πks ),k 1

where {αks } and η s are Lagrange’s multipliers. By differentiating L w.r.t. π s , we have the Karush-Kuhn-Tucker (KKT)conditions for the problem (30):(cskη s αks , k K0 (31)sπkηs ,k K0 ,αks ( πks ) 0, αks 0 for k K0 .This leads to the solution for the assumed K0 :( ,k K0 , csk η s ,s πk (K0 ) cskη s , otherwise,(32)(33)where the condition csk η s comes from αks 0. ThePKmultiplier η s is determined so k 1 πks 1. It is easy toverify that Eq. (33) satisfies the KKT conditions. For moremathematical details, see our companion paper [17].The solution πks (K0 ) is computed for different K0 ’s,and we pick the one which gives the maximum objectivevalue of Eq. (28) (not (30)). The computational cost to findthe solution is on the order of K 2 in the worst case.4.4. Algorithm summary and remarksEquations (15)-(23) and (28) are iteratively computedfor all the components k and the tasks s until convergence.Notice that the equation for Λ̄k preserves the original 1 regularized GGM formulation [3]. We see that the fewersamples a cluster have, the more the 1 regularization isapplied due to the ρ/N k term. This means that we do nottrust samples assigned to minor clusters too much. To solvethis, we can use, e.g., the graphical lasso algorithm [3].Once all the model parameters are found, with Ak λkk1 λk Λ̄ the predictive distribution is given byAlgorithm 1 Multi-task multi-modal GGMprocedure MTL-MM(D, λ0 , m0 , ρ, ,τ )Initialize {(mk , Λk )}N1, λk λ0 Kfor all k, sSet πks Krepeatfor s 1, S dofor n 1, N s dofor k 1, K dos(n)rk Eq. (15)end fors(n)s(n) PKs(n)rk rk / l 1 rlend forend forfor k 1, K dofor s 1, S doπks Eq. (28)end forPPN s s(n)SNk s 1 n 1 rkλk λ0 Nkmk Eq. (20)Λ̄k Eq. (21)end foruntil convergencereturn {π s } and {µk , Λk , λk }end proceduresuch as the AUC (area-under-curve) and the F-measure ismaximized.5. Related workIn the context of anomaly detection, there are three linesZdµk dΛk N (xs µk , (Λk ) 1 )q(µk , Λk ), of research relevant to the present work: MTL for anomalydetection, Gaussian mixture models, and MTL for GGM.k 1As explained in Introduction, the original concept ofKXsskk 1MTLis highly tempting for anomaly detection, since πk N (x m , (A ) ).(34)anomalysamples are always limited. In fact, there are ak 1numberofstudies [18], [19], [20] to attempt to pursue MTLAlgorithm 1 summarizes the proposed MTL-MM algobasedanomalydetection. However, with these methods, it iskkrithm. To initialize {m , Λ̄ }, in the context of industrialnotstraightforwardto compute variable-wise contributionsCbM, one reasonable approach is to disjointly partition eachandobtaininsightsinto the internal dependency of multidata along the time axis as Ds D1s D2s . . . and applyvariatesystems,whichare an integral part of the practicalthe graphical lasso algorithm [3] on each. For data sets ofrequirements.i.i.d. samples, on the other hand, k -means clustering [13]Gaussian mixture models have been used in a widecan be used to get {mk }, followed by graphical lasso forkvarietyof applications, and numerous prior studies ex{Λ }. The initial number of mixture components K shouldist e.g. [10], [15], [21]. However, little is known about howbe large enough to be able to automatically find an optimalto extend them to MTL in the context of anomaly detection.number of non-empty clusters, K 0 K . For standardizeddata, λ0 1 and m0 0 are a reasonable choice. For theFinally, MTL-based sparse GGM learning has been oneMIP parameters, τ can be a value in (0, 1] such as 0.1. Sinceof the recent hot topics in the machine learning and statistics has the meaning of minimum resolution of mixture weightcommunities [6], [7], [8], [9], [10], [11]. An MTL-like(the probability to find a sample in the cluster), a value suchsetting has also been discussed in the context of anomalyas 10 5 should be reasonable. Virtually the only parameterdetection [5]. However, few of them focus on the multito be determined via cross-validation is ρ. In the context ofmodality, which is critical in many real industrial applicaanomaly detection, ρ is determined so a performance metrictions, especially in anomaly detection.ps (xs D) KXπksZ

Figure 3. Ground truth precision matrices. See Sec. 6.1.Figure 5. Log likelihood towards convergence. The conventional approach(“regular”) fails to find the ground truth. See Sec. 6.1.Figure 4. Learned precision matrices. (a) Conventional ARD approach. (b)Proposed MIP approach. Mixture weights are also described on the graphs.See Sec. 6.1.6. ExperimentsThis section shows the utility of the proposed multi-taskmulti-modal framework with the convex MIP-based mixtureweight selection. We first demonstrate a better convergenceof the convex MIP formulation of the mixture weights. Wethen test performance in anomaly detection using syntheticand real-world data.6.1. Comparison with conventional ARD approachTo test the convex MIP formulation for the mixtureweight, we generated a 4-variate (M 4) synthetic dataset. Since Eq. (28) is solved independently for each tasks, we simply set S 1 in this subsection (thus the superscript s will be dropped for now). We randomly generatedN 3 800 samples with three component Gaussian mixturewith π (0.4, 0.3, 0.3) . The first component has the mean(5, 0, 0, 5) , while both of the second and third componentsshare the same mean of (0, 5, 5, 0) . The precision matricesfor these components are shown in Fig. 3.For initialization, as mentioned in Sec. 4.4, we splitthe data set into K 10 disjoint blocks, and learn theprecision matrix using the graphical lasso algorithm. Wechose parameters as ρ 0.01, τ 0.25, 10 4 . Inthe conventional ARD approach (Eq. (26)), we removedcomponents once πk is satisfied during the iteration. Inthe proposed MIP method, all the K components are keptduring the iteration, and those having πk are removedfrom the model upon convergence.Figure 4 shows learned precision matrix (in terms of thepartial correlation coefficients) and their mixture weights.We see that the proposed method precisely converged intothe ground truth (K 0 3) in spite of the initial number ofcomponents, K 10, but the conventional approach produced two spurious components. This is one manifestationof numerical instabilities of the conventional ARD method.To get further insights, we monitored the log likelihoodas a function of the number of iterations, as shown inFig. 5. We see that the proposed MIP formulation foundthe optimal solution much quicker, while the conventionalapproach gets stuck with a local minimum. The smoothcurve of the conventional approach suggests that the conventional algorithm strongly encourages convergence by forcingsmaller components to be even smaller. Although Fig. 5is just for one instance, in our repeated experiments withdifferent random number seeds for the data, the conventionalapproach produced a noticeably worse solution in mostcases.In the proposed MTL framework, the most expensivestep is to learn {Λk } (Eq. (21)). Although the MIP equation (28) incurs more computational cost than the conventional one per se, the total computational cost per one iteration is dominated by Eq. (21). Thus the smaller the numberiterations, the faster we reach the solution. We conclude thatthe proposed convex MIP-based mixture weight selectionapproach is faster and more stable than the conventionalARD approach.6.2. Multi-modal graph learning: synthetic dataTo illustrate how MTL-MM works, we compare the proposed method with two alternatives that can learn sparseand thus interpretable dependency structures in the MTLsetting: the group graphical lasso (ggl) and fused graphicallasso (fgl) algorithms [9]. These methods find task-wiseprecision matrices {Λs } by maximizingSXs 1Sno Xss sN ln Λ Tr(Ŝ Λ ) η1 kΛs k1 η2 P ({Λs }),ss 1

Task1: x3Task2: x3Task3: x31 2 3 4 5 6cluster indexprobability0.4 0.8Task3: x2task20.0Task2: x2probability0.4 0.8Task1: x2task10.0Task3: x1probability0.4 0.8Task2: x10.0Task1: x11 2 3 4 5 6cluster indextask31 2 3 4 5 6cluster indexFigure 8. Learned mixture weights π s for s 1, 2, 3.Task3: x4where Ŝs is the sample covariance matrix of the s-th task,and(P qPSΛs 2(ggl)sP ({Λ }) Pi6 j P s 1 i,j(35)ss0 Λ Λ (fgl)i,ji,js0 si,jSince the goal of these algorithms to find a single Gaussiangraphical model for each task, the anomaly score (2) isdefined as ln N (· µ̂s , Λs ) for each task s, where µ̂s isthe sample mean on the s-th task.We generated a three-task (S 3) four-variate (M 4)synthetic data. As shown in Fig. 6, the data were generatedfrom three distinctive four-variate Gaussian distributions,say, A, B, and C. The first 32 and 13 of task 1 were generatedby A and B, respectively. The first 13 and 23 of task 2 weregenerated by A and B, respectively. Task 3 was generatedonly with C. We also independently generated test data usingthe same pattern combinations.To train the MTL-MM model, we split each of the tasksinto halves and used them to initialize {(mk , Λk )}, resultingin K 6 initial number of clusters. Upon convergence,MTL-MM gave K 0 3 non-empty clusters. We used ρ 0.1,which was chosen as the minimizer of the overall anomalyscore on the test data (see below).Figures 7 and 8 show the learned precision matrices andthe mixture weights, respectively. The three graphs in Fig. 7precisely recover the pattern A, B, and C, from left to right,and the mixture weights are also consistent with the trainingdata. This result confirms the capability of our algorithm tocapture multi-moda

approach based on a sparse mixture of sparse Gaussian graphical models (GGMs). Unlike existing fused- and group-lasso-based approaches, each task is represented by a sparse mixture of sparse GGMs, and can handle multi-modalities. We develop a variational inference algorithm combined with a novel sparse mixture weight selection algorithm. To handle

Related Documents:

Registration Data Fusion Intelligent Controller Task 1.1 Task 1.3 Task 1.4 Task 1.5 Task 1.6 Task 1.2 Task 1.7 Data Fusion Function System Network DFRG Registration Task 14.1 Task 14.2 Task 14.3 Task 14.4 Task 14.5 Task 14.6 Task 14.7 . – vehicles, watercraft, aircraft, people, bats

Experimental Modal Analysis (EMA) modal model, a Finite Element Analysis (FEA) modal model, or a Hybrid modal model consisting of both EMA and FEA modal parameters. EMA mode shapes are obtained from experimental data and FEA mode shapes are obtained from an analytical finite element computer model.

LANDASAN TEORI A. Pengertian Pasar Modal Pengertian Pasar Modal adalah menurut para ahli yang diharapkan dapat menjadi rujukan penulisan sahabat ekoonomi Pengertian Pasar Modal Pasar modal adalah sebuah lembaga keuangan negara yang kegiatannya dalam hal penawaran dan perdagangan efek (surat berharga). Pasar modal bisa diartikan sebuah lembaga .

WORKED EXAMPLES Task 1: Sum of the digits Task 2: Decimal number line Task 3: Rounding money Task 4: Rounding puzzles Task 5: Negatives on a number line Task 6: Number sequences Task 7: More, less, equal Task 8: Four number sentences Task 9: Subtraction number sentences Task 10: Missing digits addition Task 11: Missing digits subtraction

An Introduction to Modal Logic 2009 Formosan Summer School on Logic, Language, and Computation 29 June-10 July, 2009 ; 9 9 B . : The Agenda Introduction Basic Modal Logic Normal Systems of Modal Logic Meta-theorems of Normal Systems Variants of Modal Logic Conclusion ; 9 9 B . ; Introduction Let me tell you the story ; 9 9 B . Introduction Historical overview .

This work is devoted to the modal analysis of a pre-stressed steel strip. Two different complementary ap-proaches exist in modal analysis, respectively the theoretical and experimental modal analyses. On the one hand, the theoretical modal analysis is related to a direct problem. It requires a model of the structure.

"fairly standard axiom in modal logic" [3: 471]. However, this is not a "fairly standard" axiom for any modal system. More precisely, it is standard only for modal system S5 by Lewis. Intuitively, this is not the most clear modal system. Nevertheless, this system is typically has taken for the modal ontological proof.

Struktur Modal pada Pasar Modal Sempurna dan Tidak Ada Pajak Pasar modal yang sempurna adalah pasar modal yang sangat kompetitif. Dalam pasar tersebut antara lain tidak dikenal biaya kebangkrutan, tidak ada biaya transaksi, bunga simpanan dan pinjaman sama dan berlaku untuk semua pihak, diasumsikan tidak ada pajak penghasilan. deden08m.com 7