Variational Bayesian Sparse Additive Matrix Factorization

1y ago
12 Views
2 Downloads
951.84 KB
34 Pages
Last View : Today
Last Download : 3m ago
Upload by : Jenson Heredia
Transcription

Machine Learning, vol.92, no.2–3, pp.319–347, 2013.Variational BayesianSparse Additive Matrix FactorizationShinichi NakajimaNikon Corporation, Japannakajima.s@nikon.co.jpMasashi SugiyamaTokyo Institute of Technology, itech.ac.jp/ sugiS. Derin BabacanUniversity of Illinois at Urbana-Champaign, USA.dbabacan@illinois.eduAbstractPrincipal component analysis (PCA) approximates a data matrix with a low-rankone by imposing sparsity on its singular values. Its robust variant can cope withspiky noise by introducing an element-wise sparse term. In this paper, we extendsuch sparse matrix learning methods, and propose a novel framework called sparseadditive matrix factorization (SAMF). SAMF systematically induces various typesof sparsity by a Bayesian regularization effect, called model-induced regularization.Although group LASSO also allows us to design arbitrary types of sparsity on amatrix, SAMF, which is based on the Bayesian framework, provides inference without any requirement for manual parameter tuning. We propose an efficient iterativealgorithm called the mean update (MU) for the variational Bayesian approximationto SAMF, which gives the global optimal solution for a large subset of parametersin each step. We demonstrate the usefulness of our method on benchmark datasetsand a foreground/background video separation problem.Keywordsvariational Bayes, robust PCA, matrix factorization, sparsity, model-induced regularization.1

Variational Bayesian Sparse Additive Matrix Factorization12IntroductionPrincipal component analysis (PCA) (Hotelling, 1933) is a classical method for obtaininglow-dimensional expression of data. PCA can be regarded as approximating a data matrixwith a low-rank one by imposing sparsity on its singular values. A robust variant of PCAfurther copes with sparse spiky noise included in observations (Candès et al., 2011; Dinget al., 2011; Babacan et al., 2012).In this paper, we extend the idea of robust PCA, and propose a more general frameworkcalled sparse additive matrix factorization (SAMF). The proposed SAMF can handlevarious types of sparse noise such as row-wise and column-wise sparsity, in addition toelement-wise sparsity (spiky noise) and low-rank sparsity (low-dimensional expression);furthermore, their arbitrary additive combination is also allowed. In the context of robustPCA, row-wise and column-wise sparsity can capture noise observed when some sensorsare broken and their outputs are always unreliable, or some accident disturbs all sensoroutputs at a time.Flexibility of SAMF in sparsity design allows us to incorporate side information moreefficiently. We show such an example in foreground/background video separation, wheresparsity is induced based on image segmentation. Although group LASSO (Yuan andLin, 2006; Raman et al., 2009) also allows arbitrary sparsity design on matrix entries,SAMF, which is based on the Bayesian framework, enables us to estimate all unknownsfrom observations, and allows us to enjoy inference without manual parameter tuning.Technically, our approach induces sparsity by the so-called model-induced regularization (MIR) (Nakajima and Sugiyama, 2011). MIR is an implicit regularization propertyof the Bayesian approach, which is based on one-to-many (i.e., redundant) mapping ofparameters and outcomes (Watanabe, 2009). In the case of matrix factorization, an observed matrix is decomposed into two redundant matrices, which was shown to inducesparsity in the singular values under the variational Bayesian approximation (Nakajimaand Sugiyama, 2011).We show that MIR in SAMF can be interpreted as automatic relevance determination(ARD) (Neal, 1996), which is a popular Bayesian approach to inducing sparsity. Nevertheless, we argue that the MIR formulation is more preferable since it allows us to derive apractically useful algorithm called the mean update (MU) from a recent theoretical result(Nakajima et al., 2013): The MU algorithm is based on the variational Bayesian approximation, and gives the global optimal solution for a large subset of parameters in eachstep. Through experiments, we show that the MU algorithm compares favorably with astandard iterative algorithm for variational Bayesian inference.2FormulationIn this section, we formulate the sparse additive matrix factorization (SAMF) model.

Variational Bayesian Sparse Additive Matrix Factorization2.13Examples of FactorizationIn standard MF, an observed matrix V RL M is modeled by a low rank target matrixU RL M contaminated with a random noise matrix E RL M .V U E.Then the target matrix U is decomposed into the product of two matrices A RM Hand B RL H :U low-rank BA H bh a h,(1)h 1where denotes the transpose of a matrix or vector. Throughout the paper, we denote acolumn vector of a matrix by a bold small letter, and a row vector by a bold small letterwith a tilde:e M ) ,A (a1 , . . . , aH ) (ea1 , . . . , aB (b1 , . . . , bH ) (eb1 , . . . , ebL ) .The last equation in Eq.(1) implies that the plain matrix product (i.e., BA ) is thesum of rank-1 components. It was elucidated that this product induces an implicit regularization effect called model-induced regularization (MIR), and a low-rank (singularcomponent-wise sparse) solution is produced under the variational Bayesian approximation (Nakajima and Sugiyama, 2011).Let us consider other types of factorization:e1 , . . . , γ e de U row ΓE D (γ1e dL L) ,Ucolumn EΓD d(γ1d e1 , . . . , γMeM ),(2)(3)dwhere ΓD diag(γ1d , . . . , γM) RM M and ΓE diag(γ1e , . . . , γLe ) RL L are diagonalmatrices, and D, E RL M . These examples are also matrix products, but one of thefactors is restricted to be diagonal. Because of this diagonal constraint, the l-th diagonalentry γle in ΓE is shared by all the entries in the l-th row of U row as a common factor.din ΓD is shared by all the entries in the m-th columnSimilarly, the m-th diagonal entry γmcolumnof U.Another example is the Hadamard (or element-wise) product:U element E D, where (E D)l,m El,m Dl,m .(4)In this factorization form, no entry in E and D is shared by more than one entry inU element .In fact, the forms (2)–(4) of factorization induce different types of sparsity, throughthe MIR mechanism. In Section 2.2, they will be derived as a row-wise, a column-wise,and an element-wise sparsity inducing terms, respectively, within a unified framework.

Variational Bayesian Sparse Additive Matrix Factorization4 !Figure 1: An example of SMF-term construction. G(·; X ) with X :(k, l′ , m′ ) 7 (l, m) maps the set {U ′(k) }Kk 1 of the PR matrices to the′(k)target matrix U , so that Ul′ ,m′ UX (k,l′ ,m′ ) Ul,m .U U U 2U1,3U2,3"U1,3U2,3"U1,3U2,3"G !G !G !!U ′(1) U1,1!U ′(2) U2,1 !U ′(2) !U′(1)U1,2U2,2"U1,3 B (1) A(1) "U2,3 B (2) A(2) U1,1U2,1" B (1) A(1) U1,2U2,2" B (2) A(2) U!"U ′(1) U1,1 B (1) A(1) !"U ′(2) U2,1 B (2) A(2) !"U ′(3) U1,2 B (3) A(3) ′(3) !U1,3U2,3" B (3) A(3) !"U ′(4) U2,2 B (4) A(4) !"U ′(5) U1,3 B (5) A(5) !"U ′(6) U2,3 B (6) A(6) Figure 2: SMF-term construction for the row-wise (top), the column-wise(middle), and the element-wise (bottom) sparse terms.2.2A General Expression of FactorizationOur general expression consists of partitioning, rearrangement, and factorization. Thefollowing is the form of a sparse matrix factorization (SMF) term:′(k)U G({U ′(k) }K B (k) A(k) .k 1 ; X ), where U(5) K′(k)′(k) M)k 1 (LHere, {A(k) , B (k) }K7 k 1 are parameters to be estimated, and G(·; X ) : RL MRis a designed function associated with an index mapping parameter X , which willbe explained shortly.Figure 1 shows how to construct an SMF term. First, we partition the entries of Uinto K parts. Then, by rearranging the entries in each part, we form partitioned-and′(k)′(k)rearranged (PR) matrices U ′(k) RL Mfor k 1, . . . , K. Finally, each of U ′(k)′(k)′(k)′(k)′(k)is decomposed into the product of A(k) RM Hand B (k) RL H , whereH ′(k) min(L′(k) , M ′(k) ).In Eq.(5), the function G(·; X ) is responsible for partitioning and rearrangement: It

Variational Bayesian Sparse Additive Matrix Factorization5Table 1: Examples of SMF term. See the main text for details.Factorization Induced sparsityK(L′(k) , M ′(k) )X : (k, l′ , m′ ) 7 (l, m)U BA low-rank1(L, M )X (1, l′ , m′ ) (l′ , m′ )U ΓE Drow-wiseL(1, M )X (k, 1, m′ ) (k, m′ )U EΓDcolumn-wiseM(L, 1)X (k, l′ , 1) (l′ , k)U E Delement-wiseL M(1, 1)X (k, 1, 1) vec-order(k)L Mmaps the set {U ′(k) }K, based on thek 1 of the PR matrices to the target matrix U R′′one-to-one map X : (k, l , m ) 7 (l, m) from the indices of the entries in {U ′(k) }Kk 1 to theindices of the entries in U , such that()′(k)(6)G({U ′(k) }K Ul,m UX (k,l′ ,m′ ) Ul′ ,m′ .k 1 ; X ) l,mAs will be discussed in Section 4.1, the SMF-term expression (5) under the variationalBayesian approximation induces low-rank sparsity in each partition. This means thatpartition-wise sparsity is also induced. Accordingly, partitioning, rearrangement, andfactorization should be designed in the following manner. Suppose that we are givena required sparsity structure on a matrix (examples of possible side information thatsuggests particular sparsity structures are given in Section 2.3). We first partition thematrix, according to the required sparsity. Some partitions can be submatrices. Werearrange each of the submatrices on which we do not want to impose low-rank sparsityinto a long vector (U ′(3) in the example in Figure 1). We leave the other submatriceswhich we want to be low-rank (U ′(2) ), the vectors (U ′(1) and U ′(4) ) and the scalars (U ′(5) )as they are. Finally, we factorize each of the PR matrices to induce sparsity through theMIR mechanism.Let us, for example, assume that row-wise sparsity is required. We first make therow-wise partition, i.e., separate U RL M into L pieces of M -dimensional row vece tors U ′(l) u R1 M . Then, we factorize each partition as U ′(l) B (l) A(l) l(see the top illustration in Figure 2). Thus, we obtain the row-wise sparse term (2).Here, X (k, 1, m′ ) (k, m′ ) makes the following connection between Eqs.(2) and (5):el A(k) RM 1 for k l. Similarly, requiring column-wise andγle B (k) R, delement-wise sparsity leads to Eqs.(3) and (4), respectively (see the bottom two illustrations in Figure 2). Table 1 summarizes how to design these SMF terms, wherevec-order(k) (1 ((k 1) mod L), ⌈k/L⌉) goes along the columns one after another inthe same way as the vec operator forming a vector by stacking the columns of a matrix(in other words, (U ′(1), . . . , U ′(K) ) vec(U )).

Variational Bayesian Sparse Additive Matrix Factorization2.36Sparse Additive Matrix FactorizationWe define a sparse additive matrix factorization (SAMF) model as the sum of SMF terms(5):V S U (s) E,(7)s 1where U(s)(s)(s) G({B (k,s) A(k,s) }K).k 1 ; X(8)In practice, SMF terms should be designed based on side information. Suppose thatV RL M consists of M samples of L-dimensional sensor outputs. In robust PCA(Candès et al., 2011; Ding et al., 2011; Babacan et al., 2012), an element-wise sparse termis added to the low-rank term, which is expected to be the clean signal, when sensoroutputs are expected to contain spiky noise:V U low-rank U element E.(9)Here, it can be said that the “expectation of spiky noise” is used as side information.Similarly, if we suspect that some sensors are broken, and their outputs are unreliableover all M samples, we should prepare the row-wise sparse term to capture the expectedrow-wise noise, and try to keep the estimated clean signal U low-rank uncontaminated withthe row-wise noise:V U low-rank U row E.If we know that some accidental disturbances occurred during the observation, but donot know their exact locations (i.e., which samples are affected), the column-wise sparseterm can effectively capture these disturbances.The SMF expression (5) enables us to use side information in a more flexible way. InSection 5.4, we show that our method can be applied to a foreground/background videoseparation problem, where moving objects (such as a person in Figure 3) are consideredto belong to the foreground. Previous approaches (Candès et al., 2011; Ding et al., 2011;Babacan et al., 2012) constructed the observation matrix V by stacking all pixels in eachframe into each column (Figure 4), and fitted it by the model (9). Here, the low-rankterm and the element-wise sparse term are expected to capture the static background andthe moving foreground, respectively. However, we can also rely on a natural assumptionthat a pixel segment having similar intensity values in an image tends to belong to thesame object. Based on this side information, we adopt a segment-wise sparse term, wherethe PR matrix is constructed using a precomputed over-segmented image (Figure 5). Wewill show in Section 5.4 that the segment-wise sparse term captures the foreground moreaccurately than the element-wise sparse term.Let us summarize the parameters of the SAMF model (7) as follows:(s)(s)Θ {ΘA , ΘB }Ss 1 ,where(s)(s)ΘA {A(k,s) }Kk 1 ,(s)(s)ΘB {B (k,s) }Kk 1 .

Variational Bayesian Sparse Additive Matrix FactorizationVpixelsbackgroundtimeforegroundFigure 3: Foreground/backgroundvideo separation task.7Figure 4: The observation matrix Vis constructed by stacking all pixelsin each frame into each column.pixelspre-segmentationtimeFigure 5: Construction of a segment-wise sparse term. The original frame is pre-segmentedand the sparseness is induced in a segment-wise manner. Details are described in Section 5.4.

Variational Bayesian Sparse Additive Matrix Factorization8As in the probabilistic MF (Salakhutdinov and Mnih, 2008), we assume independentGaussian noise and priors. Thus, the likelihood and the priors are written as 2S 1 ,(10)p(V Θ) exp 2 V U (s)2σs 1Fro()(s)SK)( 1(s)(k,s) 1 (k,s) p({ΘA }Ss 1 ) exp A,(11)tr A(k,s) CA2 s 1 k 1()S K (s)1 ( (k,s) (k,s) 1 (k,s) )(s) Sp({ΘB }s 1 ) exp tr BCBB,(12)2 s 1 k 1where · Fro and tr(·) denote the Frobenius norm and the trace of a matrix, respectively.We assume that the prior covariances of A(k,s) and B (k,s) are diagonal and positive-definite:(k,s) diag(c(k,s)2, . . . , c(k,s)2a1aH ),(k,s) diag(cb1CACB(k,s)2(k,s)2, . . . , cbH).(k,s)(k,s)Without loss of generality, we assume that the diagonal entries of CA CB are arranged(k,s) (k,s)(k,s) (k,s)in the non-increasing order, i.e., cah cbh cah′ cbh′ for any pair h h′ .2.4Variational Bayesian ApproximationThe Bayes posterior is written asp(Θ V ) p(V Θ)p(Θ),p(V )(13)where p(V ) ⟨p(V Θ)⟩p(Θ) is the marginal likelihood. Here, ⟨·⟩p denotes the expectation over the distribution p. Since the Bayes posterior (13) for matrix factorization iscomputationally intractable, the variational Bayesian (VB) approximation was proposed(Bishop, 1999; Lim and Teh, 2007; Ilin and Raiko, 2010; Babacan et al., 2012).Let r(Θ), or r for short, be a trial distribution. The following functional with respectto r is called the free energy:⟩⟨⟩⟨r(Θ)r(Θ) log log p(V ).(14)F (r V ) logp(V Θ)p(Θ) r(Θ)p(Θ V ) r(Θ)The first term is the Kullback-Leibler (KL) distance from the trial distribution to theBayes posterior, and the second term is a constant. Therefore, minimizing the free energy(14) amounts to finding a distribution closest to the Bayes posterior in the sense of the KLdistance. In the VB approximation, the free energy (14) is minimized over some restrictedfunction space.

Variational Bayesian Sparse Additive Matrix Factorization9Following the standard VB procedure (Bishop, 1999; Lim and Teh, 2007; Babacanet al., 2012), we impose the following decomposability constraint on the posterior:r(Θ) S (s)(s)(s)(s)rA (ΘA )rB (ΘB ).(15)s 1Under this constraint, it is easy to show that the VB posterior minimizing the free energy(14) is written as(s)(′(k,s)S K M (k,s) e (k,s) (k,s)b ′ ,Σr(Θ) N ′(k,s) (ea ′ ;a)Hs 1 k 1mmAm′ 1·′(k,s)L )(k,s) e(k,s) (k,s)NH ′(k,s) (ebl ′ ; bbl′ , ΣB ),(16)l′ 1where Nd (·; µ, Σ) denotes the d-dimensional Gaussian distribution with mean µ and covariance Σ.3Algorithm for SAMFIn this section, we first present a theorem that reduces a partial SAMF problem to thestandard MF problem, which can be solved analytically. Then we derive an algorithm forthe entire SAMF problem.3.1Key TheoremLet us denote the mean of U (s) , defined in Eq.(8), over the VB posterior byb (s) ⟨U (s) ⟩ (s) (s) (s) (s)Ur (Θ )r (Θ )AABB(s)(s)b (k,s) Ab(k,s) }K G({B).k 1 ; X(17)Then we obtain the following theorem (its proof is given in Appendix A):b (s′ ) }s′ ̸ s and the noise variance σ 2 , the VB posterior ofTheorem 1 Given {U(s)(s)(s)(ΘA , ΘB ) {A(k,s) , B (k,s) }Kk 1 coincides with the VB posterior of the following MFmodel:)(1′(k,s)(k,s) (k,s) 2′(k,s)(k,s)(k,s) BA,(18)p(Z A,B) exp 2 ZFro2σ)(1 ( (k,s) (k,s) 1 (k,s) )(k,s),(19)p(A) exp tr ACAA2()1 ( (k,s) (k,s) 1 (k,s) )(k,s)p(B) exp tr B,(20)CBB2

Variational Bayesian Sparse Additive Matrix Factorizationfor each k 1, . . . , K (s) . Here, Z ′(k,s) RL10′(k,s) M ′(k,s)is defined as ′(k,s)(s)b (s).Zl′ ,m′ ZX (s) (k,l′ ,m′ ) , where Z (s) V U(21)s′ ̸ sThe left formula in Eq.(21) relates the entries of Z (s) RL M to the entries of {Z ′(k,s) ′(k,s) M ′(k,s) K (s)RL}k 1 by using the map X (s) : (k, l′ , m′ ) 7 (l, m) (see Eq.(6) and Figure 1).Theorem 1 states that a partial problem of SAMF—finding the posterior ofb (s′ ) }s′ ̸ s and σ 2 — can be solved in the(A(k,s) , B (k,s) ) for each k 1, . . . , K(s), given {Usame way as in the standard VBMF, to which the global analytic solution is available(Nakajima et al., 2013). Based on this theorem, we will propose a useful algorithm in thefollowing subsections.The noise variance σ 2 is also unknown in many applications. To estimate σ 2 , we canuse the following lemma (its proof is also included in Appendix A):(s)(s)Lemma 1 Given the VB posterior for {ΘA , ΘB }Ss 1 , the noise variance σ 2 minimizingthe free energy (14) is given by))((SS{ 1′b (s )b (s) V Uσ2 V 2Fro 2tr ULMs 1s′ s 1(s)S K (b(k,s) Ab(k,s) M ′(k,s) Σ (k,s) )tr (A As 1 k 1b (k,s) Bb (k,s) L′(k,s) Σ· (BB(k,s)3.2)}) .(22)Partial Analytic SolutionTheorem 1 allows us to use the results given in Nakajima et al. (2013), which provide theglobal analytic solution for VBMF. Although the free energy of VBMF is also non-convex,Nakajima et al. (2013) showed that the minimizers can be written as a reweighted singularvalue decomposition. This allows one to solve the minimization problem separately foreach singular component, which facilitated the analysis. By finding all stationary pointsand calculating the free energy on them, they successfully obtained an analytic-form ofthe global VBMF solution.Combining Theorem 1 above and Theorems 3–5 in Nakajima et al. (2013), we obtainthe following corollaries:b (s′ ) }s′ ̸ s and the noiseCorollary 1 Assume that L′(k,s) M ′(k,s) for all (k, s), and that {U(k,s)variance σ 2 are given. Let γh ( 0) be the h-th largest singular value of Z ′(k,s) , and let(k,s)(k,s)ω ah and ω bh be the associated right and left singular vectors:Z′(k,s) ′(k,s)L h 1(k,s)γh(k,s).ω bh ω (k,s) ah(23)

Variational Bayesian Sparse Additive Matrix Factorization(k,s)secondLet γbhrespect to t:11be the second largest real solution of the following quartic equation with(k,s) 3fh (t) : t4 ξ3(k,s) 2t ξ2(k,s)t ξ1(k,s)t ξ0 0,(24)where the coefficients are defined by(L′(k,s) M ′(k,s) )2 γh ,′(k,s) M ′(k,s))( L(k,s)2(L′(k,s)2 M ′(k,s)2 )ηh2σ 4(k,s) ξ3 γh (k,s)2 (k,s)2 ,L′(k,s) M ′(k,s)cah cbh (k,s)(k,s) ξ3ξ0 ,)2(σ4(k,s)2 ηh (k,s)2 (k,s)2 ,cah cbh)()(σ 2 M ′(k,s)σ 2 L′(k,s)(k,s)21 γh. 1 s)ξ0(k,s)2ηhLet (k,s)γehwhere τ τ 2 L′(k,s) M ′(k,s) σ 4 ,(25)(L′(k,s) M ′(k,s) )σ 2σ4 (k,s)2 (k,s)2 .22cah cbhτ Then, the global VB solution can be expressed asb ′(k,s)VB (Bb (k,s) Ab(k,s) )VB U′(k,s)H (k,s)VBγbh(k,s)ω bh ω (k,s) ,ah{(k,s)secondγbh 0h 1where(k,s)VBγbh(k,s)(k,s)if γh γehotherwise.,(26)b (s′ ) }s′ ̸ s and the noiseCorollary 2 Assume that L′(k,s) M ′(k,s) for all (k, s). Given {U(k,s)(k,s)variance σ 2 , the global empirical VB solution (where the hyperparameters {CA , CB }are also estimated from observation) is given byb ′(k,s)EVBU ′(k,s)H h 1where(k,s)EVBγbh(k,s)EVBγbh(k,s),ω bh ω (k,s) ah{(k,s)VBγ̆h 0(k,s)(k,s)if γh γ (k,s)and hhotherwise. 0,(27)

Variational Bayesian Sparse Additive Matrix Factorization12Here, ′(k,s) γ (k,s) (LM ′(k,s) )σ,h(1(k,s)2(k,s)2c̆h γh (L′(k,s) M ′(k,s) )σ 2′(k,s)′(k,s)2L M)()2(k,s)2 (L′(k,s) M ′(k,s) )σ 2 4L′(k,s) M ′(k,s) σ 4 ,γh(28)(29))(k,s)γ(k,s)VBh 1γ̆ M ′(k,s) logM ′(k,s) σ 2 h()(k,s)γ(k,s)VBh L′(k,s) logγ̆ 1L′(k,s) σ 2 h)1 ((k,s) (k,s)VB′(k,s)′(k,s) (k,s)2 2 2γh γ̆h LMc̆h,σ((k,s) h(k,s)VBand γ̆h(k,s) (k,s)is the VB solution for cah cbh(k,s) c̆h(30).b (s′ ) }s′ ̸ s and the noiseCorollary 3 Assume that L′(k,s) M ′(k,s) for all (k, s). Given {Uvariance σ 2 , the VB posteriors are given byVB(k,s)rA) (k,s) (A′(k,s)H (k,s)NM ′(k,s) (ah(k,s)bh;a, σa(k,s)2IM ′(k,s) ),hh 1(k,s)rBVB)(k,s) (B ′(k,s)H (k,s)NL′(k,s) (bh;bbh(k,s)(k,s)2, σbhIL′(k,s) ),h 1(k,s)VBwhere, for γbhb h(k,s)aσa(k,s)2hbeing the solution given by Corollary 1, (k,s)(k,s)VB b(k,s)(k,s)VB b(k,s) 1(k,s)(k,s)b γbhδh · ω ah , bh γbhδh· ω bh ,1 (k,s)VB(k,s) 1(k,s) 22M ′(k,s) (bγhδbh σ 2 cah){()(k,s)2· ηbh σ 2 (M ′(k,s) L′(k,s) ) (k,s)2σbh }(k,s)2(bηh σ 2 (M ′(k,s) L′(k,s) ))2 1(k,s)VB b(k,s)2L′(k,s) (bγhδh{· ((k,s)2ηbh2(k,s) 2 σ 2 cbh)′(k,s)′(k,s) σ (M L))(k,s)24M ′(k,s) σ 2 ηbh,

Variational Bayesian Sparse Additive Matrix Factorization13Algorithm 1 Mean update (MU) algorithm for (empirical) VB SAMF.b (s) 0(L,M ) for s 1, . . . , S, σ 2 V 2 /(LM ).1: Initialization: UFro2: for s 1 to S do3:The (empirical) VB solution of U ′(k,s) B (k,s) A(k,s) for each k 1, . . . , K (s) , givenb (s′ ) }s′ ̸ s , is computed by Corollary 1 (Corollary 2).{Ub (s) G({Bb (k,s) Ab(k,s) }K (s) ; X (s) ).4:Uk 15: end for(s)(s)6: σ 2 is estimated by Lemma 1, given the VB posterior on {ΘA , ΘB }Ss 1 (computed byCorollary 3).7: Repeat 2 to 6 until convergence.} (k,s)δbh (k,s)2(bηh(k,s)2 σ 2 (M ′(k,s) L′(k,s) ))2 4L′(k,s) σ 2 ηbh{1(M ′(k,s) L′(k,s) )(γh(k,s)(k,s) 22σ 2 M ′(k,s) cah(k,s)VB γbh (k,s)2ηbh (k,s)(M ′(k,s) L′(k,s) )2 (γh ηh(k,s)2(k,s)σ4 c(k,s)2 c(k,s)2ahb,(k,s)if γh γehotherwise.(k,s)VB 2) γbh )4σ 4 L′(k,s) M ′(k,s)(k,s)2 (k,s)2cbh},cah,hNote that the corollaries above assume that L′(k,s) M ′(k,s) for all (k, s). However, wecan easily obtain the result for the case when L′(k,s) M ′(k,s) by considering the transposeb ′(k,s) of the solution. Also, we can always take the mapping X (s) so that L′(k,s) M ′(k,s)Uholds for all (k, s) without any practical restriction. This eases the implementation of thealgorithm.When σ 2 is known, Corollary 1 and Corollary 2 provide the global analytic solutionb (s′ ) }s′ ̸ s depends are fixed. Noteof the partial problem, where the variables on which {Uthat they give the global analytic solution for single-term (S 1) SAMF.3.3Mean Update AlgorithmUsing Corollaries 1–3 and Lemma 1, we propose an algorithm for SAMF, which wecall the mean update (MU) algorithm. We describe its pseudo-code in Algorithm 1, where0(d1 ,d2 ) denotes the d1 d2 matrix with all entries equal to zero. Note that, under theempirical Bayesian framework, all unknown parameters are estimated from observation,which allows inference without manual parameter tuning.The MU algorithm is similar in spirit to the backfitting algorithm (Hastie and Tibshirani, 1986; D’Souza et al., 2004), where each additive term is updated to fit a dummytarget. In the MU algorithm, Z (s) defined in Eq.(21) corresponds to the dummy target

Variational Bayesian Sparse Additive Matrix Factorization14in the backfitting algorithm. Although each of the corollaries and the lemma above guarantee the global optimality for each step, the MU algorithm does not generally guaranteethe simultaneous global optimality over the entire parameter space. Nevertheless, experimental results in Section 5 show that the MU algorithm performs very well in practice.When Corollary 1 or Corollary 2 is applied in Step 3 of Algorithm 1, a singular valuedecomposition (23) of Z ′(k,s) , defined in Eq.(21), is required. However, for many practicalSMF terms, including the row-wise, the column-wise, and the element-wise terms as well′(k,s) M ′(k,s)as the segment-wise term (which will be defined in Section 5.4), Z ′(k,s) RLis′(k,s)′(k,s)a vector or scalar, i.e., L 1 or M 1. In such cases, the singular value and thesingular vectors are given simply by(k.s) Z ′(k,s) , ω a(k.s) Z ′(k,s) / Z ′(k,s) , ω b11(k.s) Z ′(k,s) , ω a(k.s) 1, ω b11γ1γ14 1if L′(k,s) 1, Z ′(k,s) / Z ′(k,s) if M ′(k,s) 1.(k.s)(k.s)DiscussionIn this section, we first relate MIR to ARD. Then, we introduce the standard VB iterationfor SAMF, which is used as a baseline in the experiments. After that, we discuss relatedprevious work, and the limitation of the current work.4.1Relation between MIR and ARDThe MIR effect (Nakajima and Sugiyama, 2011) induced by factorization actually hasa close connection to the automatic relevance determination (ARD) effect (Neal, 1996).Assume CA IH , where Id denotes the d-dimensional identity matrix, in the plain MFmodel (18)–(20) (here we omit the suffixes k and s for brevity), and consider the followingtransformation: BA 7 U RL M . Then, the likelihood (18) and the prior (19) on Aare rewritten as()1′′2p(Z U ) exp 2 Z U Fro ,(31)2σ)()1 ( †(32)p(U B) exp tr U (BB ) U ,2where † denotes the Moore-Penrose generalized inverse of a matrix. The prior (20) onB is kept unchanged. p(U B) in Eq.(32) is so-called the ARD prior with the covariancehyperparameter BB RL L . It is known that this induces the ARD effect, i.e., theempirical Bayesian procedure where the prior covariance hyperparameter BB is alsoestimated from observation induces strong regularization and sparsity (Neal, 1996); seealso Efron and Morris (1973) for a simple Gaussian case.In the current context, Eq.(32) induces low-rank sparsity on U if no restriction on2d 2) in Eq.(3), and El,minBB is imposed. Similarly, we can show that (γle )2 in Eq.(2), (γmMLe l R , um R , and Ul,m R,Eq.(4) act as prior variances shared by the entries in u

Variational Bayesian Sparse Additive Matrix Factorization15respectively. This explains the mechanism how the factorization forms in Eqs.(2)–(4)induce row-wise, column-wise, and element-wise sparsity, respectively.When we employ the SMF-term expression (5), MIR occurs in each partition. Therefore, partition-wise sparsity and low-rank sparsity in each partition is observed. Corollaries 1 and 2 theoretically support this fact: Small singular values are discarded bythresholding in Eqs.(26) and (27).4.2Standard VB IterationFollowing the standard procedure for the VB approximation (Bishop, 1999; Lim and Teh,2007; Babacan et al., 2012), we can derive the following algorithm, which we call thestandard VB iteration:b(k,s) σ 2 Z ′(k,s) Bb (k,s) Σ (k,s),AA() 1(k,s)b (k,s) Bb (k,s) L′(k,s) Σ (k,s) σ 2 C (k,s) 1 ,ΣA σ 2 BBAb (k,s) σ 2 Z ′(k,s) Ab(k,s) Σ (k,s),BB() 1(k,s)b(k,s) Ab(k,s) M ′(k,s) Σ (k,s) σ 2 C (k,s) 1 .ΣB σ 2 AAB(33)(34)(35)(36)Iterating Eqs.(33)–(36) for each (k, s) in turn until convergence gives a local minimum ofthe free energy (14).(s) S(k,s)(k,s)In the empirical Bayesian scenario, the hyperparameters {CA , CB }Kk 1, s 1 are alsoestimated from observations. The following update rules give a local minimum of the freeenergy: /M ′(k,s) (ΣA(k,s) 2 bahc(k,s)2ah(k,s)2cbh(k,s))hh ,(k,s)(k,s) bbh 2 /L′(k,s) (ΣB )hh .(37)(38)When the noise variance σ 2 is unknown, it is estimated by Eq.(22) in each iteration.The standard VB iteration is computationally efficient since only a single parameter(s) S(k,s)2b(k,s) , Σ (k,s) , Bb (k,s) , Σ (k,s) , ca(k,s)2in {A, cbh }Khk 1, s 1 is updated in each step. However, it isABknown that the standard VB iteration is prone to suffer from the local minima problem(Nakajima et al., 2013). On the other hand, although the MU algorithm also does notguarantee the global optimality as a whole, it simultaneously gives the global optimal(s)(k,s)2b(k,s) , Σ (k,s) , Bb (k,s) , Σ (k,s) , ca(k,s)2solution for the set {A, cbh }Khk 1 for each s in each step.ABIn Section 5, we will experimentally show that the MU algorithm tends to give a bettersolution (i.e., with a smaller free energy) than the standard VB iteration.4.3Related WorkAs widely known, traditional PCA is sensitive to outliers in data and generally failsin their presence. Robust PCA (Candès et al., 2011) was developed to cope with large

Variational Bayesian Sparse Additive Matrix Factorization16outliers that are not modeled within the traditional PCA. Unlike methods based on robuststatistics (Huber and Ronchetti, 2009; Fischler and Bolles, 1981; Torre and Black, 2003;Ke and Kanade, 2005; Gao, 2008; Luttinen et al., 2009; Lakshminarayanan et al., 2011),Candès et al. (2011) explicitly modeled the spiky noise with an additional element-wisesparse term (see Eq.(9)). This model can also be applied to applications where the taskis to estimate the element-wise sparse term itself (as opposed to discarding it as noise).A typical such application is foreground/background video separation (Figure 3).The original formulation of robust PCA is non-Bayesian, and the sparsity is inducedby the ℓ1 -norm regularization. Although its solution can be efficiently obtained via theaugmented Lagrange multiplier (ALM) method (Lin et al., 2009), there are unknownalgorithmic parameters that should be carefully tuned to obtain its best performance.Employing a Bayesian formulation addresses this issue: A sampling-bas

Variational Bayesian Sparse Additive Matrix Factorization 3 2.1 Examples of Factorization In standard MF, an observed matrix V RL M is modeled by a low rank target matrix U RL M contaminated with a random noise matrix E RL M. V U E. Then the target matrix U is decomposed into the product of two matrices A RM H and B RL H: Ulow-rank BA H h 1

Related Documents:

propose a novel unified framework called sparse additive matrix factorization (SAMF). SAMF systematically induces various types of sparsity by the so-called model-induced reg-ularization in the Bayesian framework. We propose an iterative algorithm called the mean update (MU) for the variational Bayesian approximation to SAMF, which gives the .

A new sparse Bayesian learning method is developed for this purpose. The basic idea of the proposed method is . we develop a variational Bayesian method to estimate the indicator variables as well as the sparse . w denotes the additive multivariate Gaussian noise with zero mean and covariance matrix (1/γ)I. The above model can be

entropy is additive :- variational problem for A(q) . Matrix of Inference Methods EP, variational EM, VB, NBP, Gibbs EP, EM, VB, NBP, Gibbs EKF, UKF, moment matching (ADF) Particle filter Other Loopy BP Gibbs Jtree sparse linear algebra Gaussian BP Kalman filter Loopy BP, mean field, structured variational, EP, graph-cuts Gibbs

Agenda 1 Variational Principle in Statics 2 Variational Principle in Statics under Constraints 3 Variational Principle in Dynamics 4 Variational Principle in Dynamics under Constraints Shinichi Hirai (Dept. Robotics, Ritsumeikan Univ.)Analytical Mechanics: Variational Principles 2 / 69

work based on a group-sparse Gaussian scale mixture model. A hierarchical Bayesian estimation is derived using a combination of variational Bayesian inference and a subband-adaptive majorization-minimization method that simpli es computation of the posterior distribution. We show that both of these iterative methods can converge together .

this gap by deriving a Bayesian formulation of the anti-sparse coding problem (2) considered in [31]. Note that this objective differs from the contribution in [34] where a Bayesian estima-tor associated with an ' 1-norm loss function has been intro-duced. Instead, we merely introduce a Bayesian counterpart of the variational problem (2).

3. A novel sparse GPRN with an on-o process in the mixing matrices leading to sparse and variable-order mixtures of latent signals. 4. A solution to the stochastic variational infer-ence of sparse GPRN where the SVI is derived for the network of full probit-linked covariances. 2 GAUSSIAN PROCESSES We begin by introducing the basics of conventional

Adventure tourism is a “ people business ”. By its very nature it involves risks. Provid-ers need to manage those risks, so partici-pants and staff stay safe. The consequences of not doing so can be catastrophic. ISO 21101 : Adventure tourism – Safety management systems – A practical guide for SMEs provides guidance for small businesses to design and implement safety management systems .