{"title": "A New Alternating Direction Method for Linear Programming", "book": "Advances in Neural Information Processing Systems", "page_first": 1480, "page_last": 1488, "abstract": "It is well known that, for a linear program (LP) with constraint matrix $\\mathbf{A}\\in\\mathbb{R}^{m\\times n}$, the Alternating Direction Method of Multiplier converges globally and linearly at a rate $O((\\|\\mathbf{A}\\|_F^2+mn)\\log(1/\\epsilon))$. However, such a rate is related to the problem dimension and the algorithm exhibits a slow and fluctuating ``tail convergence'' in practice. In this paper, we propose a new variable splitting method of LP and prove that our method has a convergence rate of $O(\\|\\mathbf{A}\\|^2\\log(1/\\epsilon))$. The proof is based on simultaneously estimating the distance from a pair of primal dual iterates to the optimal primal and dual solution set by certain residuals. In practice, we result in a new first-order LP solver that can exploit both the sparsity and the specific structure of matrix $\\mathbf{A}$ and a significant speedup for important problems such as basis pursuit, inverse covariance matrix estimation, L1 SVM and nonnegative matrix factorization problem compared with current fastest LP solvers.", "full_text": "A New Alternating Direction Method for Linear\n\nProgramming\n\nSinong Wang\n\nDepartment of ECE\n\nThe Ohio State University\nwang.7691@osu.edu\n\nNess Shroff\n\nDepartment of ECE and CSE\n\nThe Ohio State University\nshroff.11@osu.edu\n\nAbstract\n\nIt is well known that, for a linear program (LP) with constraint matrix A \u2208 Rm\u00d7n,\nthe Alternating Direction Method of Multiplier converges globally and linearly at a\nrate O(((cid:107)A(cid:107)2\nF + mn) log(1/\u0001)). However, such a rate is related to the problem\ndimension and the algorithm exhibits a slow and \ufb02uctuating \u201ctail convergence\u201d in\npractice. In this paper, we propose a new variable splitting method of LP and prove\nthat our method has a convergence rate of O((cid:107)A(cid:107)2 log(1/\u0001)). The proof is based\non simultaneously estimating the distance from a pair of primal dual iterates to the\noptimal primal and dual solution set by certain residuals. In practice, we result\nin a new \ufb01rst-order LP solver that can exploit both the sparsity and the speci\ufb01c\nstructure of matrix A and a signi\ufb01cant speedup for important problems such as\nbasis pursuit, inverse covariance matrix estimation, L1 SVM and nonnegative\nmatrix factorization problem compared with the current fastest LP solvers.\n\n1\n\nIntroduction\n\nWe are interested in applying the Alternating Direction Method of Multiplier (ADMM) to solve a\nlinear program (LP) of the form\n\nmin\nx\u2208Rn\n\ncT x s.t. Ax = b, xi \u2265 0, i \u2208 [nb].\n\n(1)\nwhere c \u2208 Rn, A \u2208 Rm\u00d7n is the constraint matrix, b \u2208 Rm and [nb] = {1, . . . , nb}. This problem\nplays a major role in numerical optimization, and has been used in a large variety of application areas.\nFor example, several important machine learning problems including the nonnegative matrix factor-\nization (NMF) [1], l1-regularized SVM [2], sparse inverse covariance matrix estimation (SICE) [3]\nand the basis pursuit (BP) [4], and the MAP inference [5] problem can be cast into an LP setting.\nThe complexity of the traditional LP solver is still at least quadratic in the problem dimension, i.e., the\nInterior Point method (IPM) with a weighted path \ufb01nding strategy. However, many recent problems\nin machine learning have extremely large-scale targeting data but exhibit a sparse structure, i.e.,\nnnz(A) (cid:28) mn, where nnz(A) is the number of non-zero elements in the constraint matrix A. This\ncharacteristic severely limits the ability of the IPM or Simplex technique to solve these problems. On\nthe other hand, \ufb01rst-order methods have received extensive attention recently due to their ability to\ndeal with large data sets. These methods require a matrix vector multiplication Ax in each iteration\nwith complexity linear in nnz(A). However, the key challenge in designing a \ufb01rst-order algorithm is\nthat LPs are usually non-smooth and non-strongly convex optimization problems (may not have a\nunique solution). Utilizing the standard primal and dual stochastic sub-gradient descent method will\nresult in an extremely slow convergence rate, i.e., O(1/\u00012) [6].\nThe ADMM was \ufb01rst developed in 1975 [7], and since then there have been several LP solvers\nbased on this technique. Compared with the traditional Augmented Lagrangian Method (ALM), this\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmethod splits the variable into several blocks, and optimizes the augmented Lagrangian (AL) function\nin a Gauss-Seidel fashion, which often results in relatively easier subproblems to solve. However,\nthis method suffers from a slow convergence when the number of blocks increases. Moreover, the\nchallenge of applying the ADMM to the LP is that the LP problem does not exhibit an explicit\nseparable structure among variables, which are dif\ufb01cult to split in the traditional sense. The notable\nwork [8] \ufb01rst applies the ADMM to solve the LP by augmenting the original n-dimensional variables\ninto nm\u2212dimensions, and the resultant Augmented Lagrangian function is separable among n blocks\nof variables. They prove that this method converges globally and linearly. However, the rate of this\nmethod is dependent on the problem dimension m, n, and converges quite slowly when m, n are\nlarge. Thus, they leave an open question on whether other ef\ufb01cient splitting methods exist, resulting\nin convergence analysis in the space with lower dimension m or n.\nIn this paper, we propose a new splitting method for LP, which splits the equality and inequality\nconstraints into two blocks. The resultant subproblems in each iteration are a linear system with a\npositive de\ufb01nite matrix, and n one-dimensional truncation operations. We prove our new method\nconverges globally and linearly at a faster rate compared with the method in [8]. Speci\ufb01cally, the main\ncontributions of this paper can be summarized as follows: (i) We show that the existing ADMM in [8]\nexhibits a slow and \ufb02uctuating \u201ctail convergence\u201d, and provide a theoretical understanding of why\nthis phenomenon occurs. (ii) We propose a new ADMM method for LP and provide a new analysis\nof the linear convergence rate of this new method, which only involves O(m + n)\u2212dimensional\niterates. This result answers the open question proposed in [8]. (iii) We show that when the matrix A\npossesses some speci\ufb01c structure, the resultant subproblem can be solved in closed form. For the\ngeneral constraint matrix A, we design an ef\ufb01ciently implemented Accelerated Coordinate Descent\nMethod (ACDM) to solve the subproblem in O(log(1/\u0001)nnz(A)) time. (iv) Practically, we show that\nour proposed algorithm signi\ufb01cantly speeds up solving the basis pursuit, l1-regularized SVM, sparse\ninverse covariance matrix estimation, and the nonnegative matrix factorization problem compared\nwith existing splitting method [8] and the current fastest \ufb01rst-order LP solver in [9].\n\n2 Preliminaries\n\nIn this section, we \ufb01rst review several de\ufb01nitions that will be used in the sequel. Then we illustrate\nsome observations from the existing method. We also include several LP-based machine problems\nthat can be cast into the LP setting in the Appendix.\n\n2.1 Notation\nA twice differentiable function f : Rn \u2192 R has strong convexity parameter \u03c3 if and only if its\nHessian satis\ufb01es \u22072f (x) (cid:23) \u03c3I,\u2200x. We use (cid:107)\u00b7(cid:107) to denote standard l2 norm for vector or spectral norm\nfor matrix, (cid:107)\u00b7(cid:107)1 to denote the l1 norm and (cid:107)\u00b7(cid:107)F to denote the Frobenius norm. A twice differentiable\nfunction f : Rn \u2192 R has a component-wise Lipschitz continuous gradient with constant Li if\nand only if (cid:107)\u2207if (x) \u2212 \u2207if (y)(cid:107) \u2264 Li(cid:107)x \u2212 y(cid:107),\u2200x, y. For example, for the quadratic function\n2(cid:107)Ax \u2212 b(cid:107)2, the gradient \u2207F (x) = AT (Ax \u2212 b) and the Hessian \u22072F (x) = AT A.\nF (x) = 1\nHence the parameter \u03c3 and Li satisfy (choose y = x + tei, where t \u2208 R, ei \u2208 Rn is the unit\nvector), xAT Ax \u2265 \u03c3(cid:107)x(cid:107)2 and tAT\ni Aei \u2264 Li|t|,\u2200x, t. Thus, the \u03c3 is the smallest eigenvalue of\nAT A and Li = (cid:107)Ai(cid:107)2, where Ai is the ith column of the matrix A. The projection operator of\npoint x into convex set S is de\ufb01ned as [x]S = arg minu\u2208S (cid:107)x \u2212 u(cid:107). If S is the non-negative cone,\nlet [x]+ (cid:44) [x]S. Let Vi = [0,\u221e) for i \u2208 [nb] and Vi = R for i \u2208 [nf ].\n\n2.2 Tail Convergence of the Existing ADMM Method\n\nThe existing ADMM in [8] solves the LP (1) by following procedure: in each iteration k, go through\nthe following two steps:\n\n(cid:16) AT\n1. Primal update: xk+1\ni + 1(cid:107)Ai(cid:107)2\nxk\nq (Axk \u2212 b).\n2. Dual update: zk+1 = zk \u2212 \u03bb\n\ni =\n\n(cid:104)\n\ni (b\u2212Axk)\n\nq\n\n(cid:17)(cid:105)\n\n\u2212 ci\u2212AT\n\ni zk\n\n\u03bb\n\n, i = 1, . . . , n.\n\nVi\n\nWe plot the solving accuracy versus the number of iterations for solving three kinds of problems (see\nFig.1 in Appendix). We can observe that it converges fast in the initial phase, but exhibits a slow and\n\n2\n\n\f\ufb02uctuating convergence when the iterates approach the optimal set. This method originates from\na speci\ufb01c splitting method in the standard 2\u2212block ADMM [10]. To provide some understanding\nof this phenomenon, we show that this method can be actually recovered by an inexact Uzawa\nmethod [11]. The Augmented Lagrangian function of the problem (1) is denoted by L(x, z) =\n2(cid:107)Ax \u2212 b \u2212 z/\u03c1(cid:107)2. In each iteration k, the inexact Uzawa method \ufb01rst minimizes a local\ncT x + \u03c1\nsecond-order approximation of the quadratic term in L(x, zk) with respect to primal variables x,\nspeci\ufb01cally,\n\n1\n2\n\n(cid:107)x \u2212 xk(cid:107)D,\n\ncT x + (cid:104)\u03c1AT (Axk \u2212 b \u2212 zk/\u03c1), x \u2212 xk(cid:105) +\n\nxk+1 = arg min\nxi\u2208Vi\n\n(2)\nthen update the dual variables by zk+1 = zk \u2212 \u03c1(Axk+1 \u2212 b). Let the proximity parameter \u03c1 = \u03bb/q\nand matrix D equal to the diagonal matrix diag{. . . , 1/q(cid:107)Ai(cid:107)2, . . .}, then we can recover the above\nalgorithm by the \ufb01rst-order optimality condition of (2). This equivalence allows us to illustrate the\nmain reason for the slow and \ufb02uctuating \u201ctail convergence\u201d comes from the inef\ufb01ciency of such a\nlocal approximation of the Augmented Lagrangian function when the iterates approach the optimal\nset.\nOne straightforward idea to resolve this issue is to minimize the Augmented Lagrangian function\nexactly instead of its local approximation, which leads to the classic ALM. There exists a line of\nworks focusing on analyzing the convergence of applying ALM to LP [9, 12, 13]. This method will\nproduce a sequence of constrained quadratic programs (QP) that are dif\ufb01cult to solve. The work [9]\nproves that the proximal Coordinate Descent method can solve each QPs at a linear rate even when\nmatrix A is not full column rank. However, there exists several drawbacks in this approach: (i) the\npractical solving time of each subproblem is quite long when A is rank-de\ufb01cient; (ii) the theoretical\nperformance and complexity of using recent accelerated techniques in proximal optimization [14]\nwith the ALM is unknown; (iii) it cannot exploit the speci\ufb01c structure of matrix A when solving\neach constrained QP. Therefore, it motivates us to investigate the new and ef\ufb01cient variable splitting\nmethod for such a problem.\n\n3 New Splitting Method in ADMM\n\nWe \ufb01rst separate the equality and inequality constraints of the above LP (1) by adding another group\nof variables y \u2208 Rn.\n\ncT x\nmin\ns.t. Ax = b, x = y,\nyi \u2265 0, i \u2208 [nb].\n\nThe dual of problem (3) takes the following form.\n\nmin bT zx\ns.t. \u2212AT zx \u2212 zy = c,\n\nzy,i \u2264 0, i \u2208 [nb], zy,i = 0, i \u2208 [n]\\[nb].\n\n(3)\n\n(4)\n\nLet zx, zy be the Lagrange multipliers for constraints Ax = b, x = y,respectively. De\ufb01ne the\nindicator function g(y) of the non-negative cone: g(y) = 0 if yi \u2265 0,\u2200i \u2208 [nb]; otherwise\ng(y) = +\u221e. Then the augmented Lagrangian function of the primal problem (3) is de\ufb01ned as\n\nL(x, y, z) = cT x + g(y) + zT (A1x + A2y \u2212 b) +\n\n(cid:107)A1x + A2y \u2212 b(cid:107)2,\n\nwhere z = [zx; zy]. The matrix A1, A2 and vector b are denoted by\n\n(5)\n\n(6)\n\n(cid:20)A\n\n(cid:21)\n\nI\n\n(cid:21)\n\n(cid:20) 0\u2212I\n\nA1 =\n\n, A2 =\n\n, and b =\n\n.\n\n\u03c1\n2\n\n(cid:21)\n(cid:20)b\n\n0\n\nIn each iteration k, the standard ADMM go through following three steps:\n\n1. Primal update: xk+1 = arg min\nx\u2208Rn\n2. Primal update: yk+1 = arg min\ny\u2208Rn\n\nL(x, yk, zk).\n\nL(xk+1, y, zk).\n\n3\n\n\fAlgorithm 1 Alternating Direction Method of Multiplier with Inexact Subproblem Solver\n\nInitialize z0 \u2208 Rm+n, choose parameter \u03c1 > 0.\nrepeat\n\n1. Primal update: \ufb01nd xk+1 such that Fk(xk+1) \u2212 minx\u2208Rn Fk(x) \u2264 \u0001k.\n2. Primal update: for each i, let yk+1\n3. Dual update: zk+1\n\ni + zk\nx + \u03c1(Axk+1 \u2212 b), zk+1\nuntil (cid:107)Axk+1 \u2212 b(cid:107)\u221e \u2264 \u0001 and (cid:107)xk+1 \u2212 yk+1(cid:107)\u221e \u2264 \u0001\n\nVi\ny = zk\n\ny + \u03c1(xk+1 \u2212 yk+1).\n\nx = zk\n\n.\n\ni =(cid:2)xk+1\n\ny,i/\u03c1(cid:3)\n\n3. Dual update: zk+1 = zk + \u03c1(A1xk+1 + A2yk+1 \u2212 b).\n\nThe \ufb01rst step is an unconstrained quadratic program, which can be simpli\ufb01ed as\n\nxk+1 = arg min\n\nx\n\nFk(x) (cid:44) cT x + (zk)T A1x +\n\n(cid:107)A1x + A2yk \u2212 b(cid:107)2.\n\n\u03c1\n2\n\nThe gradient of the function Fk(x) can be expressed as\n\u2207Fk(x) = \u03c1(AT A + I)x + AT\n\n1 [zk + \u03c1(A2yk \u2212 b)] + c,\n\n(7)\n\n(8)\n\nand the Hessian of function Fk(x) is\n\n\u22072Fk(x) = \u03c1(AT A + I).\n\n(9)\nFurther, based on the \ufb01rst-order optimality condition, the \ufb01rst step is equivalent to solving a linear\nsystem, which requires inverting the Hessian matrix (9). In practice, the complexity is quite high to\nbe exactly solved unless the Hessian exhibits some speci\ufb01c structures. Thus, we relax the \ufb01rst step\ninto the inexact minimization: \ufb01nd xk+1 such that\nFk(xk+1) \u2212 min\nx\u2208Rn\n\nFk(x) \u2264 \u0001k,\n\n(10)\n\nwhere \u0001k is the given accuracy. Transforming the indicator function g(y) back to the constraints, the\nsecond step can be separated into n one\u2212dimensional optimization problems: for each i,\n\nyk+1\ni = arg min\nyi\u2208Vi\n\n\u2212zk\n\ny,iyi +\n\n(yi \u2212 xk+1\n\ni\n\n\u03c1\n2\n\n)2 =(cid:2)xk+1\n\ny,i/\u03c1(cid:3)\n\ni + zk\n\n.\n\nVi\n\nThe resultant algorithm is sketched in Algorithm 1. In some applications such as l1-regularized\nSVMs and basis pursuit problem, the objective function contains the l1 norm of the variables.\nTransforming to the canonical form (1) will introduce additional n variables and 2n constraints. One\nimportant feature in our method is that we can split the objective function by adding variable y. The\ncorresponding subproblems are similar with Algorithm 1 and the only difference is that the second\nstep will be n one\u2212dimensional shrinkage operations. (Details can be seen in Appendix.)\n\n4 Convergence Analysis of New ADMM\n\nIn this section, we prove that the Algorithm 1 converges at a global and linear rate, and provide\na roadmap of the main technical development. We can \ufb01rst write the primal problem (3) as the\nfollowing standard 2\u2212block form.\n\nmin\nx,y\n\nf (x) + g(y)\n\ns.t. A1x + A2y = b,\n\n(11)\n\nwhere f (x) = cT x and g(y) is the indicator function as de\ufb01ned before. Most works in the literature\nprove that the 2-block ADMM converges globally and linearly via assuming that one of the functions\nf and g is strongly convex [15, 16, 17]. Unfortunately, both the linear function f and the indicator\nfunction g in the LP do not satisfy this property, which poses a signi\ufb01cant challenge on the current\nanalytical framework. There exists several recent works trying to address this problem in some sense.\nIn work [18], they have demonstrated that when the dual step size \u03c1 is suf\ufb01ciently small (impractical),\nthe ADMM converges globally linearly, while no implicit rate is given. The work [13] shows that\nthe ADMM is locally linearly converged when applying to LP. They utilize a unique combination of\niterates and conduct a spectral analysis. However, they still leave an open question whether ADMM\nconverges globally and linearly when applying to the LP in the above form.\n\n4\n\n\fIn the sequel, we will answer this question positively and provide an accurate analysis of such a\nsplitting method. The main technical development is based on a geometric argument: we \ufb01rst prove\nthat the set formed by optimal primal and dual solutions of LP (3) is a (3n + m)\u2212dimensional\npolyhedron S\u2217; then we utilize certain global error bound to simultaneously estimate the distance\nfrom iterates xk+1, yk, zk to S\u2217. All detailed proofs are given in the Appendix.\nLemma 1. (Convergence of 2-block ADMM [10]) Let pk = zk \u2212 \u03c1A2yk, we have\n\n(cid:107)pk+1 \u2212 [pk+1]G\u2217(cid:107)2 \u2264 (cid:107)pk \u2212 [pk]G\u2217(cid:107)2 \u2212 (cid:107)pk+1 \u2212 pk(cid:107)2,\n\nwhere G\u2217 (cid:44) {p\u2217 \u2208 Rm+n|T (p\u2217) = p\u2217}, and the de\ufb01nition of operator T is given in (54) in\nAppendix. Moreover, if the LP (3) has a pair of optimal primal and dual solution, the iterates xk,yk\nand zk converges to an optimal solution; Otherwise, at least one of the iterates is unbounded.\n\n(cid:107)pk \u2212 [pk]G\u2217(cid:107) \u2264 \u03b3(cid:107)pk+1 \u2212 pk(cid:107), \u03b3 > 0.\n\nLemma 1 is tailored from applying the classic Douglas-Rachford splitting method to the LP. This result\nguarantees that the sequence pk produced by ADMM globally converges under a mild assumption.\nHowever, to establish the linear convergence rate, the key lies in estimating the other side inequality,\n(12)\nThen one can combine these two results together to prove that sequence pk converges globally and\nlinearly with (cid:107)pk+1\u2212 [pk+1]G\u2217(cid:107)2 \u2264 (1\u2212 1/\u03b32)\u00b7(cid:107)pk \u2212 [pk]G\u2217(cid:107)2, which further can be used to show\nthe R\u2212linear convergence of iterates xk, yk and zk. To estimate the constant \u03b3, we \ufb01rst describe the\ngeometry formed by the optimal primal solutions x\u2217, y\u2217 and dual solutions z\u2217 of the LP (3).\nLemma 2. (Geometry of the optimal solution set of LP) The variables (x\u2217, y\u2217) are the optimal\nprimal solutions and z\u2217 are optimal dual solutions of LP (3) if and only if (i) Ax\u2217 = b, x\u2217 = y\u2217; (ii)\n\u2212AT z\u2217\nIn Lemma 2, one interesting element is to utilize the strong duality condition (iv) to eliminate the\ncomplementary slackness in the standard KKT condition. Then, the set of optimal primal and dual\nsolutions is described only by af\ufb01ne constraints, which further implies that the optimal solution set is\nan (m + 3n)\u2212dimensional polyhedron. We use S\u2217 to denote such a polyhedron.\nLemma 3. (Hoffman bound [19, 20]) Consider a polyhedron set S = {x \u2208 Rd|Ex = t, Cx \u2264 d}.\nFor any point x \u2208 Rd, we have\n\ny,i = 0, i \u2208 [n]\\[nb]; (iv) cT x\u2217 + bT z\u2217\n\ny,i \u2264 0, i \u2208 [nb]; z\u2217\n\ny = c; (iii) y\u2217\n\ni \u2265 0, z\u2217\n\nx \u2212 z\u2217\n\nx = 0.\n\n(cid:107)x \u2212 [x]S(cid:107) \u2264 \u03b8S\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:20) Ex \u2212 t\n\n[Cx \u2212 d]+\n\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13) ,\n\n(13)\n\nwhere \u03b8S is the Hoffman constant that depends on the structure of polyhedron S.\nAccording to the result in Lemma 2, it seems that we can use the Hoffman bound to estimate the\ndistance between the current iterates (xk, yk, zk) and the solution set S\u2217 via the their primal and\ndual residual. However, to obtain the form of inequality (12), we need to bound such a residual in\nterms of (cid:107)pk \u2212 pk+1(cid:107). Indeed, we have these results.\nLemma 4. (Estimation of residual) The sequence (xk+1, yk, zk) produced by Algorithm 1 satis\ufb01es\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nA1xk+1 + A2yk \u2212 b = (pk+1 \u2212 pk)/\u03c1,\nc + AT\n1 zk = AT\ncT xk+1 + bT zk\ni \u2265 0, zk\nyk\n\n1 (pk \u2212 pk+1),\nx = (A1xk+1 \u2212 zk/\u03c1)T (pk \u2212 pk+1),\n\ny,i = 0, i \u2208 [n]\\[nb].\n\ny,i \u2264 0, i \u2208 [nb]; zk\n\nOne observation from Lemma 4 is that Algorithm 1 automatically preserves the boundness and the\ncomplementary slackness of both primal and dual iterates. Instead, in the previous algorithm in [8],\nthe complementary slackness is not preserved during the iteration. Combining the results in Lemma 2,\nLemma 3 and Lemma 4, we are readily to estimate the constant \u03b3.\nLemma 5. (Estimation of linear rate) The sequence pk = zk \u2212 \u03c1A2yk produced by Algorithm 1\nsatis\ufb01es (cid:107)pk \u2212 [pk]G\u2217(cid:107) \u2264 \u03b3(cid:107)pk+1 \u2212 pk(cid:107), where the rate \u03b3 is given by\n\n(14)\nRx = supk (cid:107)xk(cid:107) < +\u221e, Rz = supk (cid:107)zk(cid:107) < +\u221e are the maximum radius of iterates xk and zk.\n\n\u03b3 = (1 + \u03c1)\n\n\u03b8S\u2217 .\n\n\u03c1\n\n+ Rx(cid:107)A1(cid:107) + (cid:107)AT\n1 (cid:107)\n\n(cid:20) Rz + 1\n\n(cid:21)\n\n5\n\n\fThen we can establish the global and linear convergence of Algorithm 1.\nTheorem 1. (Linear convergence of Algorithm 1) Denote zk as the primal iterates produced by\nAlgorithm 1. To guarantee that there exists an optimal dual solution z\u2217 such that (cid:107)zk \u2212 z\u2217(cid:107) \u2264 \u0001, it\nsuf\ufb01ces to run Algorithm 1 for number of iterations K = 2\u03b32 log(2D0/\u0001) with the solving accuracy\n\u0001k satisfying \u0001k \u2264 \u00012/8K 2, where D0 = (cid:107)p0 \u2212 [p0]G\u2217(cid:107).\nThe proof of Theorem 1 consists of two steps: \ufb01rst, we establish the global and linear convergence\nrate of Algorithm 1 when \u0001k = 0,\u2200k (exact subproblem solver); then we relax this condition and\nprove that when \u0001k is less than a speci\ufb01ed threshold, the algorithm still shares a convergence rate of\nthe same order. The results of primal iterates xk and yk are similar.\n\n5 Ef\ufb01cient Subproblem Solver\n\nIn this section, we will show that, due to our speci\ufb01c splitting method, each subproblem in line 1 of\nAlgorithm 1 can be either solved in closed-form expression or ef\ufb01ciently solved by the Accelerated\nCoordinate Descent Method.\n\n5.1 Well-structured Constraint Matrix\n\n(I + AT A)\u22121 = I \u2212 AT (I + AAT )\u22121A.\n\nxk+1 = \u03c1\u22121(I + AT A)\u22121dk, with dk = \u2212AT\n\nLet the gradient (8) vanish, then the primal iterates xk+1 can be exactly determined by\n1 [zk + \u03c1(A2yk \u2212 b)] \u2212 c,\n\n(15)\nwhich requires inverting an n \u00d7 n positive de\ufb01nite matrix I + AT A, or equivalently, inverting an\nm \u00d7 m positive de\ufb01nite matrix I + AAT via the following Sherman\u2013Morrison\u2013Woodbury identity,\n(16)\nOne basic fact is that we only need to invert such a matrix once and then use this cached factorization\nin subsequent iterations. Therefore, there are several cases for which the above factorization can\nbe ef\ufb01ciently calculated: (i) Factorization has a closed-form expression. For example, in the LP-\nbased MAP inference [5], the matrix I + AT A is block diagonal, and each block has been shown\nto possess a closed-form factorization. Another important application is that, in the basis pursuit\nproblem, the encoding matrices such as DFT (discrete Fourier transform) and DWHT (discrete\nWalsh-Hadamard transform) matrices have orthonormal rows and satisfy AAT = I. Based on\n(15), each xk+1 = \u03c1\u22121(I \u2212 1\n2 AT A)dk and can be calculated in O(n log(n)) time by certain\nfast transforms. (ii) Factorization has a low-complexity: the dimension m (or n) is small, i.e.,\nm = 104. Such a factorization can be calculated in O(m3) and the complexity of each iteration is\nonly O(nnz(A) + m2). Detailed applications can be viewed in Appendix.\nRemark 1. In the traditional Augmented Lagrangian method, the resultant subproblem is a con-\nstrained and non-strongly convex QP (Hessian is not invertible), which does not allow the above\nclose-form expression. Besides, in the ALCD [9], the coordinate descent (CD) step only picks one\ncolumn in each iteration and cannot exploit the nice structure of matrix A. One idea is to modify the\nCD step in [9] to the proximal gradient descent. However, it will greatly increase the computation\ntime due to the large number of inner gradient descent steps.\n\n5.2 General Constraint Matrix\n\nHowever, in other applications, the constraint matrix A only exhibits the sparsity, which is dif\ufb01cult\nto invert. To resolve this issue, we resort to the current fastest accelerated coordinate descent\nmethod [21]. This method has an order improvement up to O(\nn) of iteration complexity compared\nwith previous accelerated coordinate descent methods [22]. However, the naive evaluation of partial\nderivative of function Fk(x) in ACDM takes O(nnz(A)) time; second, the time cost of full vector\noperation in each iteration of ACDM is O(n). We will show that these dif\ufb01culties can be tackled by a\ncarefully designed implementation technique1 and the main procedure is listed in Algorithm 2. Here\nthe iterates st and matrix M in Algorithm 2 is de\ufb01ned as\n\n\u221a\n\n(cid:20)1 \u2212 \u03b1v\n\n\u03b2u\n\nM =\n\n(cid:21)\n\n\u03b1v\n1 \u2212 \u03b2u\n\n(cid:20)\u03b1v\n\n(cid:21)\n\n\u03b2u\n\nwith\n\n=\n\nand si\n\nt =\n\n1This technique is motivated by [22].\n\n(cid:34)(cid:16)\n\n(cid:17)\u2207iFk(ut)eT\n\ni\n\n(cid:35)\n\n\u03b7\u03c4\n\npi(1+\u03b7\u03c1) + 1\u2212\u03c4\n\nLi\n\npi(1+\u03b7\u03c1)\u2207iFk(ut)eT\n\n\u03b7\n\ni\n\n,\n\n(17)\n\n(cid:20) \u03c4\n\n(cid:21)\n\n1+\u03b7\u03c1\n\n\u03b7\u03c1\n\n1+\u03b7\u03c1\n\n6\n\n\fInitialize u0, v0, u0 = Au0, v0 = Av0, matrix M, parameter \u03c4, \u03b7, S by (17) and distribution\n\nrepeat\n\nAlgorithm 2 Ef\ufb01ciently Subproblem Solver\n\np = [. . . ,(cid:112)1 + (cid:107)Ai(cid:107)2/S, . . . ] and let dk = AT\n(cid:21)\n\n[ut, vt]T = Mt\u22121 \u00b7 [u, v]T and [ut, vt]T = Mt\u22121 \u00b7 [u, v]T .\nSample i from [n] based on probability distribution p.\n\u2207iFk(ut) = \u03c1(Ai)T ut + \u03c1ut,i + dk\ni , and calculate si\n\u2212 M\u22121\nMt = M \u00b7 Mt\u22121. Update\nt,\nt si\n=\n\n(cid:20)uT\n\n(cid:20)uT\n\n1 [zk + \u03c1(A2yk \u2212 b)] + c.\n(cid:21)\n\n(cid:20)uT\n\n(cid:20)uT\n\nt by (17).\n=\n\n(cid:21)\n\n(cid:21)\n\nvT\n\nvT\n\nvT\n\nvT\n\nuntil Converge\nOutput xk+1 = (uT \u2212 \u03c4 vT )/(1 \u2212 \u03c4 ).\n\n\u2212 M\u22121\nt si\n\ntAT ,\n\nwhere \u03b7 = 1\n\n\u03c4 S2 , \u03c4 =\n\n\u221a\n\n1+\n\n2\n\n4S2/\u03c1+1\n\nLemma 6. (Inner complexity) In each iteration of Algorithm 2, if the current picked coordinate\nis i, the update can be \ufb01nished in O(nnz(Ai)) time, moreover, to guarantee that Fk(xk+1) \u2212\nminx Fk(x) \u2264 \u0001k with probability 1 \u2212 p, it suf\ufb01ces to run Algorithm 2 for number of iterations\n\n, S =(cid:80)n\n\ni=1\n\n(cid:112)(cid:107)Ai(cid:107)2 + 1. See more details in Appendix.\n(cid:19)\n\n0 = (cid:107)F k(u0) \u2212 min\n\nF k(x)(cid:107).\n\n, Dk\n\nx\n\n(cid:18) Dk\n\n0\n\u0001kp\n\n(18)\n\nTk \u2265 O(1) \u00b7 n(cid:88)\n\ni=1\n\n(cid:107)Ai(cid:107) log\n\nThe above iteration complexity is obtained by choosing parameter \u03b2 = 0 in [21] and utilizing the\nTheorem 1 in [23] to transform the convergence in expectation to the form of probability.\nTheorem 2. (Overall complexity) Denote zk as the dual iterates produced by Algorithm 1. To\nguarantee that there exists an optimal solution z\u2217 such that (cid:107)zk \u2212 z\u2217(cid:107) \u2264 \u0001 with probability 1 \u2212 p, it\nsuf\ufb01ces to run Algorithm 1 for k \u2265 2\u03b32 log(2D0/\u0001) outer iterations and solve each sub-problem (7)\nfor the number of inner iterations\n\nT \u2265 O(1) \u00b7 n(cid:88)\n\ni=1\n\n(cid:32)\n\n(cid:18) 2D0\n\n(cid:19)(cid:33)\n\n\u0001\n\n(cid:107)Ai(cid:107) log\n\n3 \u03b32\n\n0 ) 1\n\u03c1(Dk\n\u0001 2\n3 p 1\n\n3\n\nlog\n\n.\n\n(19)\n\nThe results for the primal iterates xk and yk are similar. In the existing ADMM [8], each primal and\ndual update only requires O(nnz(A)) time to solve. The complexity of this method is\n\n\u221a\nO(am\u00b52(amRx + dmRz)2(\n\nmn + (cid:107)A(cid:107)F )2nnz(A) log(1/\u0001)),\n\nwhere am = maxi (cid:107)Ai(cid:107), dm is the largest number of non-zero elements of each row of matrix A,\nand \u00b5 is the Hoffman constant depends on the optimal solution set of LP. Based on Theorem 2, an\nestimation of the worst-case complexity of Algorithm 1 is\n\nO(am\u03b82\n\nS\u2217 (Rx(cid:107)A(cid:107) + Rz)2nnz(A) log2(1/\u0001)).\n\nRemark that our method has a weak dependence on the problem dimension compared with the\nexisting ADMM. Since the Frobenius norm of a matrix satis\ufb01es (cid:107)A(cid:107)2 \u2264 (cid:107)A(cid:107)F , our method is faster\nthan the one in [8].\n\n6 Numerical Results\n\nIn this section, we examine the performance of our algorithm and compare it with the state-of-art\nof algorithms developed for solving the LP. The \ufb01rst is the existing ADMM in [8]. The second is\nthe ALCD method in [9], which is reported to be the current fastest \ufb01rst-order LP solver. They have\nshown that this algorithm can signi\ufb01cantly speed up solving several important machine learning\nproblems compared with the Simplex and IPM. We name our Algorithm 1 as LPADMM. In the\nexperiments, we require that the accuracy of subproblem solver \u0001k = 10\u22123 and the stopping criteria\nis that both primal residual (cid:107)A1xk + A2yk \u2212 b(cid:107)\u221e and dual residual (cid:107)AT\n1 zk + c(cid:107)\u221e is less than\n10\u22123. All the LP instances are generated from the basis pursuit, L1 SVM, SICE and NMF problems.\nThe data source and statistics are included in the supplementary material.\n\n7\n\n\fFigure 1: The duality gap versus the number of iterations. From left to right \ufb01gures are the BP, NMF,\nthe L1 SVM and and the SICE problem.\nTable 1: Timing Results for BP, SICE, NMF and L1 SVM Problem (in sec. long means > 60 hours)\n\nLPADMM\n\nIterations\n\nData\n\nbp1\nbp2\nbp3\narcene\nreal-sim\nsonar\ncolon\nw2a\n\nnews20\n\nm\n\nn\n\nnnz(A)\n\n17408\n34816\n69632\n50095\n176986\n80912\n217580\n12048256\n2785205\n\n16384\n32768\n65536\n30097\n135072\n68224\n161040\n12146960\n2498375\n\n8421376\n33619968\n134348800\n1151775\n7609186\n2756832\n8439626\n167299110\n53625267\n\nTime\n22\n79\n217\n801\n955\n258\n395\n19630\n7765\n\nALCD\n\nIterations\n\n14534\n19036\n24760\n176060\n18262\n13789\n1288\n8492\n6174\n\nTime\n864\n2846\n12862\n1978\n1906\n659\n455\n45388\n9173\n\nTime\nlong\nlong\nlong\n21329\n19697\n3828\n7423\nlong\nlong\n\nADMM\n\nIterations\n\nlong\nlong\nlong\n\n2035415\n249363\n151972\n83680\nlong\nlong\n\n3155\n4657\n6287\n15198\n4274\n5446\n216\n2525\n2205\n\nWe \ufb01rst compare the convergence rate of different algorithms in solving the above problems. We\nuse the bp1 for BP problem, data set colon cancer for NMF problem, news20 for L1 SVM problem\nand real-sim for SICE problem. We set proximity parameter \u03c1 = 1. We adopt the relative duality\nx(cid:107)/(cid:107)cT x\u2217(cid:107), where x\u2217 is obtained\ngap as the comparison metric, which is de\ufb01ned as (cid:107)cT xk + bT zk\napproximately by running our method with a strict stopping condition. In our simulation, one iteration\nrepresents n coordinate descent steps for ALCD and LPADMM, and one dual updating step for\nADMM. As can be seen in the Fig. 1, our new method exhibits a global and linear convergence rate\nand matches our theoretical performance bound. Besides, it converges faster than both the ALCD and\nexisting ADMM method, especially in solving the BP and NMF problem. The sensitivity analysis of\n\u03c1 is listed in Appendix.\nWe next examine the performance of our algorithm from the perspective of time ef\ufb01ciency (both\nclocking time and number of iterations). We adopt the dynamic step size rule for ALCD to optimize\nits performance. Note that, exchanging the role of the primal and dual problem in (3), we can obtain\nthe dual version of both ADMM and ACLD, which can be used to tackle the primal or dual sparse\nproblem. We run both methods and adopt the minimum time. The stopping criterion requires that the\nprimal and dual residual and the relative duality gap is less than 10\u22123. The data set bp1,bp2,bp3 is\nused for basis pursuit problem, news20 is used for L1 SVM problem; arcene, real-sim are used for\nSICE problem; sonar, colon and w2a are used for NMF problem. Among all experiments, we can\nobserve that our proposed algorithm requires approximately 10% \u2212 40% iterations and 10% \u2212 85%\ntime of the ALCD method, and become particularly advantageous for basis pursuit problem (50\u00d7\nspeed up) or ill posed problems such as SICE and NMF problem. In particular, for the basis pursuit\nproblem, the primal iterates xk is updated by closed-form expression (15), which can be calculated in\nO(n log(n)) time by Fast Walsh\u2013Hadamard transform.\n\n7 Conclusions\n\nIn this paper, we proposed a new variable splitting method to solve the linear programming problem.\nThe theoretical contribution of this work is that we prove that 2\u2212block ADMM converges globally and\nlinearly when applying to the linear program. The obtained convergence rate has a weak dependence\nof the problem dimension and is less than the best known result. Compared with the existing LP\nsolvers, our algorithms not only provides a \ufb02exibility to exploit the speci\ufb01c structure of constraint\nmatrix A, but also can be naturally combined with the existing acceleration techniques to signi\ufb01cantly\nspeed up solving the large-scale machine learning problems. The future work focuses on generalizing\nour theoretical framework and exhibiting the global linear convergence rate when applying ADMM\nto solve a convex quadratic program.\nAcknowledgments: This work is supported by ONR N00014-17-1-2417, N00014-15-1-2166, NSF\nCNS-1719371 and ARO W911NF-1-0277.\n\n8\n\n050001000015000Number of iterations10-310-210-1100101102Duality gapADMMLPADMM ALCD10002000300040005000Number of iterations10-310-210-1100101Duality gapADMM LPADMM ALCD01000200030004000500060007000Number of iterations10-310-210-1100101Duality gapADMM LPADMM ALCD050100150200Number of iterations10-310-210-1100101102Duality gapADMM LPADMM ALCD\fReferences\n[1] Ben Recht, Christopher Re, Joel Tropp, and Victor Bittorf. Factoring nonnegative matrices with linear\n\nprograms. In Advances in Neural Information Processing Systems, pages 1214\u20131222, 2012.\n\n[2] Ji Zhu, Saharon Rosset, Trevor Hastie, and Robert Tibshirani. 1-norm support vector machines. In NIPS,\n\nvolume 15, pages 49\u201356, 2003.\n\n[3] Ming Yuan. High dimensional inverse covariance matrix estimation via linear programming. Journal of\n\nMachine Learning Research, 11(Aug):2261\u20132286, 2010.\n\n[4] Junfeng Yang and Yin Zhang. Alternating direction algorithms for l1-problems in compressive sensing.\n\nSIAM journal on scienti\ufb01c computing, 33(1):250\u2013278, 2011.\n\n[5] Ofer Meshi and Amir Globerson. An alternating direction method for dual map lp relaxation. Machine\n\nLearning and Knowledge Discovery in Databases, pages 470\u2013483, 2011.\n\n[6] V\u00e2nia L\u00facia Dos Santos Eleut\u00e9rio. Finding approximate solutions for large scale linear programs. PhD\n\nthesis, ETH Zurich, 2009.\n\n[7] Roland Glowinski and A Marroco. Sur l\u2019approximation, par \u00e9l\u00e9ments \ufb01nis d\u2019ordre un, et la r\u00e9solution, par\np\u00e9nalisation-dualit\u00e9 d\u2019une classe de probl\u00e8mes de dirichlet non lin\u00e9aires. Revue fran\u00e7aise d\u2019automatique,\ninformatique, recherche op\u00e9rationnelle. Analyse num\u00e9rique, 9(2):41\u201376, 1975.\n\n[8] Jonathan Eckstein, Dimitri P Bertsekas, et al. An alternating direction method for linear programming.\n\n1990.\n\n[9] Ian En-Hsu Yen, Kai Zhong, Cho-Jui Hsieh, Pradeep K Ravikumar, and Inderjit S Dhillon. Sparse linear\nprogramming via primal and dual augmented coordinate descent. In Advances in Neural Information\nProcessing Systems, pages 2368\u20132376, 2015.\n\n[10] Jonathan Eckstein and Dimitri P Bertsekas. On the douglas-rachford splitting method and the proximal\n\npoint algorithm for maximal monotone operators. Mathematical Programming, 55(1):293\u2013318, 1992.\n\n[11] Wotao Yin. Analysis and generalizations of the linearized bregman method. SIAM Journal on Imaging\n\nSciences, 3(4):856\u2013877, 2010.\n\n[12] O G\u00fcler. Augmented lagrangian algorithms for linear programming. Journal of optimization theory and\n\napplications, 75(3):445\u2013470, 1992.\n\n[13] Daniel Boley. Local linear convergence of the alternating direction method of multipliers on quadratic or\n\nlinear programs. SIAM Journal on Optimization, 23(4):2183\u20132207, 2013.\n\n[14] Qihang Lin, Zhaosong Lu, and Lin Xiao. An accelerated proximal coordinate gradient method. In Advances\n\nin Neural Information Processing Systems, pages 3059\u20133067, 2014.\n\n[15] Robert Nishihara, Laurent Lessard, Benjamin Recht, Andrew Packard, and Michael I Jordan. A general\n\nanalysis of the convergence of admm. In ICML, pages 343\u2013352, 2015.\n\n[16] Tianyi Lin, Shiqian Ma, and Shuzhong Zhang. On the global linear convergence of the admm with\n\nmultiblock variables. SIAM Journal on Optimization, 25(3):1478\u20131497, 2015.\n\n[17] Wei Deng and Wotao Yin. On the global and linear convergence of the generalized alternating direction\n\nmethod of multipliers. Journal of Scienti\ufb01c Computing, 66(3):889\u2013916, 2016.\n\n[18] Mingyi Hong and Zhi-Quan Luo. On the linear convergence of the alternating direction method of\n\nmultipliers. Mathematical Programming, pages 1\u201335, 2012.\n\n[19] Alan J Hoffman. On approximate solutions of systems of linear inequalities. Journal of Research of the\n\nNational Bureau of Standards, 49(4), 1952.\n\n[20] Wu Li. Sharp lipschitz constants for basic optimal solutions and basic feasible solutions of linear programs.\n\nSIAM journal on control and optimization, 32(1):140\u2013153, 1994.\n\n[21] Zeyuan Allen-Zhu, Zheng Qu, Peter Richtarik, and Yang Yuan. Even faster accelerated coordinate descent\nusing non-uniform sampling. In Proceedings of The 33rd International Conference on Machine Learning,\npages 1110\u20131119, 2016.\n\n[22] Yin Tat Lee and Aaron Sidford. Ef\ufb01cient accelerated coordinate descent methods and faster algorithms for\nsolving linear systems. In Foundations of Computer Science (FOCS), 2013 IEEE 54th Annual Symposium\non, pages 147\u2013156. IEEE, 2013.\n\n[23] Peter Richt\u00e1rik and Martin Tak\u00e1\u02c7c. Iteration complexity of randomized block-coordinate descent methods\n\nfor minimizing a composite function. Mathematical Programming, 144(1-2):1\u201338, 2014.\n\n9\n\n\f", "award": [], "sourceid": 947, "authors": [{"given_name": "Sinong", "family_name": "Wang", "institution": "The Ohio State University"}, {"given_name": "Ness", "family_name": "Shroff", "institution": "The Ohio State University"}]}