Fast Epigraphical Projection-based Incremental Algorithms .

3y ago
17 Views
3 Downloads
848.46 KB
23 Pages
Last View : 2m ago
Last Download : 3m ago
Upload by : Duke Fulford
Transcription

Fast Epigraphical Projection-based Incremental Algorithms forWasserstein Distributionally Robust Support Vector MachineJiajin LiDepartment of Systems Engineering and Engineering ManagementThe Chinese University of Hong Kongjjli@se.cuhk.edu.hkCaihua Chen School of Management and EngineeringNanjing Universitychchen@nju.edu.cnAnthony Man-Cho SoDepartment of Systems Engineering and Engineering ManagementThe Chinese University of Hong Kongmanchoso@se.cuhk.edu.hkAbstractWasserstein Distributionally Robust Optimization (DRO) is concerned with findingdecisions that perform well on data that are drawn from the worst-case probabilitydistribution within a Wasserstein ball centered at a certain nominal distribution. Inrecent years, it has been shown that various DRO formulations of learning modelsadmit tractable convex reformulations. However, most existing works proposeto solve these convex reformulations by general-purpose solvers, which are notwell-suited for tackling large-scale problems. In this paper, we focus on a familyof Wasserstein distributionally robust support vector machine (DRSVM) problemsand propose two novel epigraphical projection-based incremental algorithms tosolve them. The updates in each iteration of these algorithms can be computed in ahighly efficient manner. Moreover, we show that the DRSVM problems consideredin this paper satisfy a Hölderian growth condition with explicitly determinedgrowth exponents. Consequently, we are able to establish the convergence ratesof the proposed incremental algorithms. Our numerical results indicate that theproposed methods are orders of magnitude faster than the state-of-the-art, and theperformance gap grows considerably as the problem size increases.1IntroductionWasserstein distance-based distributionally robust optimization (DRO) has recently received significant attention in the machine learning community. This can be attributed to its ability to improvegeneralization performance by robustifying the learning model against unseen data [13, 22]. TheDRO approach offers a principled way to regularize empiricial risk minimization problems andprovides a transparent probabilistic interpretation of a wide range of existing regularization techniques; see, e.g., [4, 10, 22] and the references therein. Moreover, many representative distributionallyrobust learning models admit equivalent reformulations as tractable convex programs via strong Corresponding author34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

duality [22, 12, 18, 27, 11]. Currently, a standard approach to solving these reformulations is to useoff-the-shelf solvers such as YALMIP or CPLEX. However, these general-purpose solvers do notscale well with the problem size. Such a state of affairs greatly limits the use of the DRO methodologyin machine learning applications and naturally motivates the study of the algorithmic aspects of DRO.In this paper, we aim to design fast iterative methods for solving a family of Wasserstein distributionally robust support vector machine (DRSVM) problems. The SVM is one of the most frequentlyused classification methods and has enjoyed notable empirical successes in machine learning anddata analysis [25, 23]. However, even for this seemingly simple learning model, there are very fewworks addressing the development of fast algorithms for its Wasserstein DRO formulation, whichtakes the form inf { 2c kwk22 sup E(x,y) Q [ w (x, y)]} and can be reformulated aswQ B p (P̂n )nmin λ w,λ c1Xmax 1 wT zi , 1 wT zi λκ, 0 kwk22 , s.t. kwkq λ;n i 12(1)see [12, Theorem 2] and [22, Theorem 3.11]. Problem (1) arises from the vanilla soft-margin SVMmodel. Here, 2c kwk22 is the regularization term with c 0; x Rd denotes a feature vector andy { 1, 1} is the associated binary label; w (x, y) max{1 ywT x, 0} is the hinge lossw.r.t. the feature-label pair (x, y) and learning parameter w Rd ; {(x̂i , ŷi )}ni 1 are n trainingsamples independently and identically drawn from an unknown distribution P on the feature-labelPnspace Z Rd { 1, 1} and zi x̂i ŷi ; P̂n n1 i 1 δ(x̂i ,ŷi ) is the empirical distributionassociated with the training samples; B p (P̂n ) {Q P(Z) : Wp (Q, P̂n ) } is the ambiguity setdefined on the space of probability distributions P(Z) centered at the empirical distribution P̂n andhas radius 0 w.r.t. the p norm-induced Wasserstein distance Z 0000Wp (Q, P̂n ) infdp (ξ, ξ ) Π(dξ, dξ ) : Π(dξ, Z) Q(dξ), Π(Z, dξ ) P̂n (dξ ) ,Π P(Z Z)Z Zwhere ξ (x, y) Z, 1, and dp (ξ, ξ 0 ) kx x0 kp κ2 y y 0 is the transport cost betweentwo data points ξ, ξ 0 Z with κ 0 representing the relative emphasis between feature mismatchand label uncertainty. In particular, the larger the κ, the more reliable are the labels; see [22, 15]for further details. Intuitively, if the ambiguity set B p (P̂n ) contains the ground-truth distributionP , then the estimator w obtained from an optimal solution to (1) will be less sensitive to unseenfeature-label pairs.11p qIn the works [12, 18], the authors proposed cutting surface-based methods to solve the p -DRSVMproblem (1). However, in their implementation, they still need to invoke off-the-shelf solvers forcertain tasks. Recently, researchers have proposed to use stochastic (sub)gradient descent to tacklea class of Wasserstein DRO problems [5, 24]. Nevertheless, the results in [5, 24] do not applyto the p -DRSVM problem (1), as they require κ ; i.e., the labels are error-free. Moreover,the transport cost dp does not satisfy the strong convexity-type condition in [5, Assumption 1] or[24, Assumption A]. On another front, the authors of [15] introduced an ADMM-based first-orderalgorithmic framework to deal with the Wasserstein distributionally robust logistic regression problem.Though the framework in [15] can be extended to handle the p -DRSVM problem (1), it has two maindrawbacks. First, under the framework, the optimal λ of problem (1) is found by an one-dimensionalsearch, where each update involves fixing λ to a given value and solving for the correspondingoptimal w (λ) (which we refer to as the w-subproblem). Since the number of w-subproblems thatarise during the search can be large, the framework is computationally rather demanding. Second,the w-subproblem is solved by an ADMM-type algorithm, which involves both primal and dualupdates. In order to establish fast (e.g., linear) convergence rate guarantee for the algorithm, onetypically requires a regularity condition on the set of primal-dual optimal pairs of the problem athand. Unfortunately, it is not clear whether the p -DRSVM problem (1) satisfies such a primal-dualregularity condition.To overcome these drawbacks, we propose two new epigraphical projection-based incrementalalgorithms for solving the p -DRSVM problem (1), which tackle the variables (w, λ) jointly. Wefocus on the commonly used 1 , 2 , and norm-induced transport costs, which correspond toq {1, 2, }. Our first algorithm is the incremental projected subgradient descent (ISG) method,whose efficiency inherits from that of the projection onto the epigraph {(w, λ) : kwkq λ} of the q2

Table 1: Convergence rates of incremental algorithms for p -DRSVMqcHölderian growthStep size schemeConvergence rateq 1, q 1, c 0c 0q 2c 0q 2c 0Sharp [8, Theorem 3.5]QG [28, Proposition 6]Sharp (BLR)Not KnownQG (BLR)Not Knownαk 1 ραk , ρ (0, 1)γ,γ 0αk nkαk 1 ραk , ρ (0, 1)γαk n ,γ 0kγαk nk , γ 0γαk n ,γ 0kO(ρk )O( k1 )O(ρk )O( 1k )O( k1 )O( 1k )BLR: The result holds under the assumption of bounded linear regularity (BLR) (see Definition 2).Not Known: Without BLR, it is not known whether the Hölderian growth condition holds.norm (with q {1, 2, }). The second is the incremental proximal point algorithm (IPPA). Althoughin general IPPA is less sensitive to the choice of initial step size and can achieve better accuracy thanISG [16], in the context of the p -DRSVM problem (1), each iteration of IPPA requires solving thefollowing subproblem, which we refer to as the single-sample proximal point update: 1min max 1 wT zi , 1 wT zi λκ, 0 (kw w̄k22 (λ λ̄)2 ), s.t. kwkq λ. (2)w,λ2αHere, α 0 is the step size, q {1, 2, }, and w̄, λ̄ are given. By carefully exploiting the problemstructure, we develop exceptionally efficient solutions to (2). Specifically, we show in Section 3that the optimal solution to (2) admits an analytic form when q 2 and can be computed by a fastalgorithm based on a parametric approach and a modified secant method (cf. [9]) when q 1 or .Next, we investigate the convergence behavior of the proposed ISG and IPPA when applied toproblem (1). Our main tool is the following regularity notion:Definition 1 (Hölderian growth condition [6]) A function f : Rm R is said to satisfy a Hölderian growth condition on the domain Ω Rm if there exist constants θ [0, 1] and σ 0 such thatdist(x, X ) σ 1 (f (x) f )θ , x Ω,(3)where X denotes the optimal set of minx Ω f (x) and f is the optimal value. The condition (3) isknown as sharpness when θ 1 and quadratic growth (QG) when θ 12 ; see, e.g., [7].We show that for different choices of q {1, 2, } and c 0, the DRSVM problem (1) satisfieseither the sharpness condition or QG condition; see Table 1. With the exception of the case q {1, }, where the sharpness (resp. QG) of (1) when c 0 (resp. c 0) essentially follows from [8,Theorem 3.5] (resp. [28, Proposition 6]), the results on the Hölderian growth of problem (1) are new.Consequently, by choosing step sizes that decay at a suitable rate, we establish, for the first time,the fast sublinear (i.e., O( k1 )) or linear (i.e., O(ρk )) convergence rate of the proposed incrementalalgorithms when applied to the DRSVM problem (1); see Table 1.Lastly, we demonstrate the efficiency of our proposed methods through extensive numerical experiments on both synthetic and real data sets. It is worth mentioning that our proposed algorithms canbe easily extended to an asynchronous decentralized parallel setting and thus can further meet therequirements of large-scale applications.2Epigraphical Projection-based Incremental AlgorithmsIn this section, we present our incremental algorithms for solving the p -DRSVM problem. Forsimplicity, we focus on the case c 0 in what follows. Our technical development can be extendedto handle the general case c 0 by noting that the subproblems corresponding to the cases c 0 andc 0 share the same structure.To begin, observe that the p -DRSVM problem (1) with c 0 can be written compactly asn1Xminfi (w, λ),kwkq λ ni 13(4)

where fi (w, λ) λ max 1 wT zi , 1 wT zi λκ, 0 is a piecewise affine function. Sinceproblem (4) possesses the vanilla finite-sum structure with a single epigraphical projection constraint,a natural and widely adopted approach to tackling it is to use incremental algorithms. Roughlyspeaking, such algorithms select one mini-batch of component functions from the objective in (4) ata time based on a certain cyclic order and use the selected functions to update the current iterate. Weshall focus on the following two incremental algorithms for solving the DRSVM problem (1). Here,k is the epoch index (i.e., the k-th time going through the cyclic order) and αk 0 is the step size inthe k-th epoch.Incremental Mini-batch Projected Subgradient Algorithm (ISG) k(wi 1, λki 1 ) proj{kwkq λ} (wik , λki ) αk gik ,(5)Pwhere gik is a subgradient of B1i j Bi fj at (wik , λki ) and Bi {1, . . . , n} is the i-th mini-batch.Incremental Proximal Point Algorithm (IPPA) 1kkk 2k 2(wi 1 , λi 1 ) arg min fi (w, λ) kw wi k2 (λ λi ),2αkkwkq λ(6)where (wnk , λkn ) (w0k 1 , λk 1).0Now, a natural question is how to solve the subproblems (5) and (6) efficiently. As it turns out, thekey lies in an efficient implementation of the q norm epigraphical projection (with q {1, 2, }).Indeed, such a projection appears explicitly in the ISG update (5) and, as we shall see later, plays avital role in the design of fast iterative algorithms for the single-sample proximal point update (6).To begin, we note that the 2 norm epigraphical projection proj{kwk2 λ} has a well-known analyticsolution; see [1, Theorem 3.3.6]. Next, the 1 norm epigraphical projection proj{kwk1 λ} can befound in linear time using the quick-select algorithm; see [26]. Lastly, the norm epigraphicalprojection proj{kwk λ} can be computed in linear time via the Moreau decompositionproj{kwk λ} (x, s) (x, s) proj{kwk1 λ} ( x, s).From the above discussion, we see that the ISG update (5) can be computed efficiently. In the nextsection, we discuss how these epigraphical projections can be used to perform the single-sampleproximal point update (6) in an efficient manner.3Fast Algorithms for Single-Sample Proximal Point Update (6)Analytic solution for q 2. We begin with the case q 2. By combining the terms λ and1k 22αk (λ λi ) in (6), we see that the single-sample proximal point update takes the form (cf. (2)) 1min max 1 wT zi , 1 wT zi λκ, 0 (kw w̄k22 (λ λ̄)2 ), s.t. kwk2 λ.w,λ {z} 2α(7)hi (w,λ)The main difficulty of (7) lies in the piecewise affine term hi . To handle this term, let hi,1 (w, λ) 1 wT zi , hi,2 (w, λ) 1 wT zi λκ, and hi,3 (w, λ) 0, so that hi maxj {1,2,3} hi,j . Observethat if (w , λ ) is an optimal solution to (7), then there could only be one, two, or three affine piecesin hi that are active at (w , λ ); i.e., Γ {j : hi (w , λ ) hi,j (w , λ )} {1, 2, 3}. Thissuggests that we can find (w , λ ) by exhausting these possibilities. Due to space limitation, we onlygive an outline of our strategy here. The details can be found in the Appendix.We start with the case Γ 1. For j 1, 2, 3, consider the following problem, which corresponds tothe subcase where hi,j is the only active affine piece:min hi,j (w, λ) w,λ1(kw w̄k22 (λ λ̄)2 ), s.t. kwk2 λ.2α(8)Since hi,j is affine in (w, λ), it is easy to verify that problem (8) reduces to an 2 norm epigraphicalprojection, which admits an analytic solution, say (ŵj , λ̂j ). If there exists a j 0 {1, 2, 3} such4

that hi,j 0 (ŵj 0 , λ̂j 0 ) hi,j (ŵj 0 , λ̂j 0 ) for j 6 j 0 , then we know that (ŵj 0 , λ̂j 0 ) is optimal for (7) andhence we can terminate the process. Otherwise, we proceed to the case Γ 2 and consider, for1 j j 0 3, the following problem, which corresponds to the subcase where hi,j and hi,j 0 arethe only two active affine pieces:min hi,j (w, λ) w,λ1(kw w̄k22 (λ λ̄)2 ), s.t. hi,j (w, λ) hi,j 0 (w, λ), kwk2 λ.2α(9)As shown in the Appendix (Proposition 6.2), the optimal solution to (9) can be found by solvinga univariate quartic equation, which can be done efficiently. Now, let (ŵ(j,j 0 ) , λ̂(j,j 0 ) ) be the optimal solution to (9). If there exist j, j 0 with 1 j j 0 3 such that hi,j (ŵ(j,j 0 ) , λ̂(j,j 0 ) ) hi,j 0 (ŵ(j,j 0 ) , λ̂(j,j 0 ) ) hi,j 00 (ŵ(j,j 0 ) , λ̂(j,j 0 ) ) with j 00 {1, 2, 3} \ {j, j 0 }, then (ŵ(j,j 0 ) , λ̂(j,j 0 ) ) isoptimal for (7) and we can terminate the process. Otherwise, we proceed to the case Γ 3. In thiscase, we consider the problemminw,λ1(kw w̄k22 (λ λ̄)2 ), s.t. hi,1 (w, λ) hi,2 (w, λ) hi,3 (w, λ), kwk2 λ,2αwhich reduces to12kw w̄k22 , s.t. wT zi 1, kwk2 .(10)2ακIt can be shown that problem (10) admits an analytic solution ŵ; see the Appendix (Proposition 6.4).Then, the pair (ŵ, κ2 ) is an optimal solution to (7).minwFast iterative algorithm for q 1. The high-level idea is similar to that for the case q 2; i.e.,we systematically go through all valid subcollections of the affine pieces in hi and test whetherthey can be active at the optimal solution to the single-sample proximal point update. The maindifference here is that the subproblems arising from the subcollections do not necessarily admitanalytic solutions. To overcome this difficulty, we propose a modified secant algorithm (cf. [9]) tosearch for the optimal dual multiplier of the subproblem and use it to recover the optimal solutionto the original subproblem via 1 norm epigraphical projection. Again, we give an outline of ourstrategy here and relegate the details to the Appendix.To begin, we rewrite the single-sample proximal point update (6) for the case q 1 as 1kw w̄k22 (λ λ̄)22αs.t. hi,j (w, λ) µ, j 1, 2, 3; kwk1 λ.min µ w,λ,µ(11)For reason that would become clear in a moment, we shall not go through the cases Γ 1, 2, 3 asbefore. Instead, consider first the case where hi,3 is inactive. If hi,1 is also inactive, then we consider1the problem minw,λ hi,2 (w, λ) 2αkw w̄k22 (λ λ̄)2 , which, by the affine nature of hi,2 , isequivalent to an 1 norm epigraphical projection. If hi,1 is active, then we consider the problemmin hi,1 (w, λ) w,λ1(kw w̄k22 (λ λ̄)2 ), s.t. hi,2 (w, λ) hi,1 (w, λ), kwk1 λ.2α(12)Note that hi,2 can be active or inactive, and the constraint hi,2 (w, λ) hi,1 (w, λ) allows us to treatboth possibilities simultaneously. Hence, we do not need to tackle them separately as we did in thecase q 2. Observe that problem (12) can be cast into the formminw,λ1(kw w̄k22 (λ λ̄)2 ), s.t. wT z aλ b ( σ 0), kwk1 λ,2α(13)where, with an abuse of notation, we use w̄ Rd , λ̄ R here again and caution the reader that theyare different from those in (12), and z zi , a κ2 , b 0. Before we discuss how to solve thesubproblem (13), let us note that it arises in the case where hi,3 is active as well. Indeed, if hi,3 isactive and hi,1 is inactive, then we have z zi , a κ, b 1, which corresponds to the constrainthi,2 (w, λ) hi,3 (w, λ) and covers the possibilities that hi,2 is active and inactive. On the other hand,if hi,3 is active and hi,2 is inactive, then we have z zi , a 0, b 1, which corresponds to theconstraint hi,1 (w, λ) hi,3 (w, λ) and covers the possibilities that hi,1 is active and inactive. Theonly remaining case is when hi,1 , hi,2 , hi,3 are all active. In this case, we consider problem (10) with5

kwk2 κ2 replaced by kwk1 κ2 . As shown in the Appendix, such a problem can be tackled usingthe technique for solving (13). We go through the above cases sequentially and terminate the processif the solution to the subproblem in any one of the cases satisfies the optimality conditions of (11).Now, let us return to the issue of solving (13). The main idea is to perform an one-dimensional searchon the dual variable σ to find the optimal dual multiplier σ . Specifically, consider the problem 1minkw w̄k22 (λ λ̄)2 σ(wT z aκ b).(14)2αkwk1 λLet (ŵ(σ), λ̂(σ)) be the optimal solution to (14) and define the function p : R R by p(σ) ŵ(σ)T z aκ b. Inspired by [17], we establish the following monotonicity property of p, whichwill be crucial to our development of an extremely efficient algorithm for solving (13) later.Proposition 3.1 If σ satisfies (i) σ 0 and p(σ) 0, or (ii) p(σ) 0, then (ŵ(σ), λ̂(σ)) is theoptimal solution to (13). Moreover, p is continuous and monotonically non-increasing on R .In view of Proposition 3.1, we first check if p(0) 0 via an 1 norm epigraphical projection. Ifnot, then we search for the σ 0 that satisfies p(σ ) 0 by the secant method, with some specialmodifications designed to speed up its convergence [9]. Let us now give a high-level description ofour modified secant method. We refer the reader to the Appendix (Algorithm 1) for details.At the beginning of a generic iteration of the method, we have an interval [σl , σu ] that contains σ ,with rl p(σl ) 0 and ru p(σu ) 0. The initial interval can be found by considering theoptimality conditions of (11) (i.e., σ [0, 1]). We then take a secant step to get a new point σ withr p(σ) and perform the update on σl , σu as follows.Suppose that r 0. I

It is worth mentioning that our proposed algorithms can be easily extended to an asynchronous decentralized parallel setting and thus can further meet the requirements of large-scale applications. 2 Epigraphical Projection-based Incremental Algorithms In this section, we present our incremental algorithms for solving the ‘ p-DRSVM problem. For

Related Documents:

Orthographic Projection. TOPICS Object representation Glass box concept Line convention Orthographic projection of point, line, plane, surfaceand object. Multiview projection. OBJECT REPRESENTATION Axonometric projection Multiview projection. MULTIVIEW PROJECTION Three principle dimensions

Some other works on incremental learning and its applications include the incremental learning fuzzy neural (ILFN) network for fault detection and classification [5], incremental learning for multi-sensor data fusion [6], incremental genetic learning for data classification [7], incremental semi-supervised learn-ing [8], incremental learning .

Orthographic projection is a technique that is used to create multiview drawings. Orthographic projection. is any projection of the features of an object onto an imaginary plane of projection. The . projection plane, projection line, glass box, multiview drawing Created Date:

Host Side Single click the control key Approve full-screen projection request from client side Client Side Click and hold the projection key to send the request of projection Host Side Single click the control key Reject projection request from client side Host Side Double click the projection key Close full-screen projection on client side

44 Incremental Backups: John Snow; FOSDEM 2017 Life Cycle - First Incremental (The first step of our journey) Example 3: Create an incremental backup. Can be done via transaction or single QMP command. { "execute": "drive-backup", "arguments": {"device": "drive0", "bitmap": "bitmap0", "target": "inc.0.qcow2", "format": "qcow2", "sync": "incremental",

A projection is formed by the intersection of certain lines (projectors) with the view plane. Projectors are lines from the center of projection through each point in the object. Center of Projection Center of projection at infinity results with a parallel projection. A finite center of

Perspective Projection P r o j e c t i o n p l a n e Extend lines from each point on the scene to the center of projection (camera position). Where these lines intersect with the projection plane is where we draw the object. Center of projection Orthographic vs. Perspective Ob

Conditional Random Fields: An Introduction Hanna M. Wallach February 24, 2004 1 Labeling Sequential Data The task of assigning label sequences to a set of observation sequences arises in many fields, including bioinformatics, computational linguistics and speech recognition [6, 9, 12]. For example, consider the natural language processing