ADMM: the OSQP Solver Explained!

Optimization
Author

Wangkun Xu

Published

September 4, 2024

Modified

September 28, 2024

This post introduces the OSQP solver, which is an operatir splitting solver for quadratic programs. The original paper is Stellato, Bartolomeo, et al. “OSQP: An operator splitting solver for quadratic programs.” Mathematical Programming Computation 12.4 (2020): 637-672.. We also borrow the basic idea from the ADMM review Boyd, Stephen, et al. “Distributed optimization and statistical learning via the alternating direction method of multipliers.” Foundations and Trends® in Machine learning 3.1 (2011): 1-122..

There are many features of the OSQP, compared to other ADMM-based algorithms. For example,

  1. The algorithm does not require on the problem data such as positive definiteness of the objective function or LICQ condition of the constraints, meaning that the algorithm is applicable to the entire class of convex quadratic programs.
  2. The algorithm requires the solution of a quasi-definite linear system with the same coefficient matrix at almost every iteration, which is very efficient. (usually ten times faster than interior-point methods.)
  3. The method also supports factorization caching and warm starting, making it suitable for parametric programming.

In contrast, the interior-point methods are not easily warm started and do not scale well for very large problems. ADMM, instead, has been shown to reliably provide modest accurate solutions to QPs in a relatively small number of computationally inexpensive iterations.

The drawback of ADMM (or any other first-order method) is that its convergence is highly dependent on the choice of step-size parameter and the problem data.

Introduction

The OSQP considers the following QP problem, \[ \begin{array}{ll} \operatorname{minimize} & (1 / 2) x^T P x+q^T x \\ \text { subject to } & l \leq A x \leq u, \end{array} \] where \(P \in \mathbb{S}^{n}_{+}\), \(q \in \mathbb{R}^{n}\), \(A \in \mathbb{R}^{m \times n}\), \(l \in \mathbb{R}^{m} \cup -\infty^m\), and \(u \in \mathbb{R}^{m} \cup \infty^m\). Note that the linear programming can be considered as a special case with \(P=0\).

Equivalently, the problem can be rewritten by introducing the slack variables \(z\) \[ \begin{array}{ll} \operatorname{minimize} & (1 / 2) x^T P x+q^T x \\ \text { subject to } & A x=z \\ & l \leq z \leq u \end{array} \]

Optimal Condition

The KKT condition of the original problem is $$ \[\begin{aligned} Px + q + A^T y^u - A^T y^l & = 0 \\ Ax - u & \leq 0 \\ -Ax + l & \leq 0 \\ y^u \circ (Ax - u) & = 0 \\ y^l \circ (-Ax + l) & = 0 \\ y^u & \geq 0 \\ y^l & \geq 0 \end{aligned}\]

where \(y^u\) and \(y^l\) are the dual variables associated with the upper and lower constraints.

Now let \(y=y^u - y^l\) and let \(Ax = z\). The first KKT condition becomes \(Px + q +A^Ty = 0\). The second and third conditions becomes \(l \leq z \leq u\). Moreover, when \(y^u_i > 0\), we have \(A_ix=u_i\). Then if at the same time \(y^l_i > 0\), we have \(A_ix=l_i\), which means that \(y^l_i=y_i^u\), e.g., an equality constraint. Otherwise, we must have \(y^l=0\). Concequentlty, the complementary slackness conditions become \(y_+^T\circ(z-u)=0\) and \(y_-^T\circ(z-l)=0\). To sum up, the KKT condition can be rewritten as \[ \begin{aligned} Px + q + A^Ty = 0 \\ Ax = z \\ l \leq z \leq u \\ y_+ \circ (z-u) = 0 \\ y_- \circ (z-l) = 0 \\ \end{aligned} \] where \(y\) be the slack variable for the equality constraint.

Moreover, we can define the primal and dual residuals as \[ \begin{aligned} r_{\text {prim }} & =A x-z \\ r_{\text {dual }} & =P x+q+A^T y \end{aligned} \]

Solution with ADMM

By introducing auxiliary variables \(\tilde{x}\in\mathbb{R}^n\) and \(\tilde{z}\in\mathbb{R}^m\), we can have an equivalent problem, \[ \begin{array}{ll} \operatorname{minimize} & (1 / 2) f(\tilde{x},\tilde{z}) + \mathcal{I}_\mathcal{C}(z) \\ \text{ subject to } & (\tilde{x},\tilde{z}) = (x,z) \end{array} \] where \(f(\tilde{x},\tilde{z}) = \tilde{x}^T P \tilde{x} + q^T \tilde{x}\) with domain \(\operatorname{dom} f = \{(\tilde{x},\tilde{z}):A\tilde{x} = \tilde{z}\}\) and the indicator function \[ \mathcal{I}_{\mathcal{C}}(z)= \begin{cases}0 & z \in \mathcal{C} \\ +\infty & \text { otherwise }\end{cases} \]

Similar to the standard ADMM, we can view \(\tilde{x}\) and \(\tilde{z}\) as the primal variables, and \(x\) and \(z\) as the slack variables. Therefore, the first step is to update \(\tilde{x}\) and \(\tilde{z}\). The second step is to update \(x\) and \(z\). Both by minimising the Augmented Lagrangian function. The third step is to update the dual variable \(y\).

In detail, the unscaled ADMM algorithm is as follows, \[ \begin{aligned} &\left(\tilde{x}^{k+1}, \tilde{z}^{k+1}\right) \leftarrow \underset{(\tilde{x}, \tilde{z}): A \tilde{x}=\tilde{z}}{\operatorname{argmin}}(1 / 2) \tilde{x}^T P \tilde{x}+q^T \tilde{x} \\ &+(\sigma / 2)\left\|\tilde{x}-x^k+\sigma^{-1} w^k\right\|_2^2+(\rho / 2)\left\|\tilde{z}-z^k+\rho^{-1} y^k\right\|_2^2 \end{aligned} \] \[ x^{k+1} \leftarrow \alpha \tilde{x}^{k+1}+(1-\alpha) x^k+\sigma^{-1} w^k \] \[ z^{k+1} \leftarrow \Pi\left(\alpha \tilde{z}^{k+1}+(1-\alpha) z^k+\rho^{-1} y^k\right) \] \[ w^{k+1} \leftarrow w^k+\sigma\left(\alpha \tilde{x}^{k+1}+(1-\alpha) x^k-x^{k+1}\right) \] \[ y^{k+1} \leftarrow y^k+\rho\left(\alpha \tilde{z}^{k+1}+(1-\alpha) z^k-z^{k+1}\right) \] where

  • \(\sigma>0\) and \(\rho>0\) are the penalty/step parameters.
  • \(\alpha\in(0,2)\) is the relaxation parameter.
  • \(\Pi\) is the projection operator onto the feasible set \(\mathcal{C}\).
  • \((w^k,y^k)\) are the dual variables associated with the \(\tilde{x} = x\) and \(\tilde{z} = z\) constraints.
  • It can be derived \(w^{k+1}=0\) for all \(k\geq0\). Therefore, the \(w\) update can be omitted.

Solving the linear system

The introduction of \(\tilde{x}\) and \(\tilde{z}\) ensures that the first update is always solvable even the when \(P=0\) and \(A\) is not full row rank (the LICQ consition does not hold).

The \(\tilde{x},\tilde{z}\) update is to solve a equality constrained QP, whose KKT condition is given as (note that \(w^k=0\)): \[ \begin{aligned} P \tilde{x}^{k+1}+q+\sigma\left(\tilde{x}^{k+1}-x^k\right)+A^T v^{k+1} & =0 \\ \rho\left(\tilde{z}^{k+1}-z^k\right)+y^k-v^{k+1} & =0 \\ A \tilde{x}^{k+1}-\tilde{z}^{k+1} & =0 \end{aligned} \] where \(v^{k+1}\) is the dual variable associated with the equality constraint. Eliminating \(\tilde{z}^{k+1}\) results in \[ \underbrace{\left[\begin{array}{cc} P+\sigma I & A^T \\ A & -\rho^{-1} I \end{array}\right]}_{K}\left[\begin{array}{c} \tilde{x}^{k+1} \\ v^{k+1} \end{array}\right]=\underbrace{\left[\begin{array}{c} \sigma x^k-q \\ z^k-\rho^{-1} y^k \end{array}\right]}_{y} \] and \[ \tilde{z}^{k+1}=z^k+\rho^{-1}\left(\nu^{k+1}-y^k\right) \]

Note that \(K\) is always invertible. For a given problem, if \(\rho\) and \(\sigma\) are unchanged during the iteration, we can factorize \(K\) once and reuse the factorization at each iteration. For example, in pytorch,

import torch
LU, pivots = torch.linalg.lu_factor(K)
x = torch.linalg.lu_solve(LU, pivots, b)

which significantly reduces the computational cost. For instance, the complexity of the factorization is \(O(n^3)\). In interior-point methods, this complexity is repeated at each iteration but in OSQP, we only need to factorize once (or when \(\rho\) and \(\sigma\) change).

Stopping criterion

\[ \left\|r_{\text {prim }}^k\right\|_{\infty} \leq \varepsilon_{\text {prim }}, \quad\left\|r_{\text {dual }}^k\right\|_{\infty} \leq \varepsilon_{\text {dual }} \] with \[ \begin{aligned} \varepsilon_{\text {prim }} & =\varepsilon_{\text {abs }}+\varepsilon_{\text {rel }} \max \left\{\left\|A x^k\right\|_{\infty},\left\|z^k\right\|_{\infty}\right\} \\ \varepsilon_{\text {dual }} & =\varepsilon_{\text {abs }}+\varepsilon_{\text {rel }} \max \left\{\left\|P x^k\right\|_{\infty},\left\|A^T y^k\right\|_{\infty},\|q\|_{\infty}\right\} \end{aligned} \]

Solution Polishing

Assume that the ‘less accurate’ solution \((x,y,z)\) can correctly identify the active constraints, we can refine the solution by solving a new QP with known active constraints as equality constraints.

Given the dual solution, we can define the set of lower- and upper-active constraints as \[ \begin{aligned} \mathcal{L} & =\left\{i \in\{1, \ldots, m\} \mid y_i<0\right\} \\ \mathcal{U} & =\left\{i \in\{1, \ldots, m\} \mid y_i>0\right\} \end{aligned} \]

Once the set of active constraints are known, then the solution can be polished by solving \[ \left[\begin{array}{ccc} P & A_{\mathcal{L}}^T & A_{\mathcal{U}}^T \\ A_{\mathcal{L}} & \\ A_{\mathcal{U}} & & \end{array}\right]\left[\begin{array}{c} x \\ y_{\mathcal{L}} \\ y_{\mathcal{U}} \end{array}\right]=\left[\begin{array}{c} -q \\ l_{\mathcal{L}} \\ u_{\mathcal{U}} \end{array}\right] \] \[ y_i=0, \quad i \notin(\mathcal{L} \cup \mathcal{U}) \] \[ z=A x \]

The above system is much smaller than the original system (e.g., reduce the size of the system from \(n+2m\) to less than \(2n\) when the problem is not degenerate).

However, the above system is not always solvable, e.g. when \(P\) is not positive definite or when the problem is degenerate. In this case, a small perutrbation can be added \(\delta I\) can be added, \[ \underbrace{\left[\begin{array}{ccc} P+\delta I & A_{\mathcal{L}}^T & A_{\mathcal{U}}^T \\ A_{\mathcal{L}} & -\delta I & \\ A_{\mathcal{U}} & & -\delta I \end{array}\right]}_{K} \left[\begin{array}{c} \hat{x}^{\prime} \\ \hat{y}_{\mathcal{L}} \\ \hat{y}_{\mathcal{U}} \end{array}\right] = \underbrace{\left[\begin{array}{c} -q \\ l_{\mathcal{L}} \\ u_{\mathcal{U}} \end{array}\right]}_{g} \] with small positive \(\delta\).

To solve the problem, we can consider the following iteration: \[ \hat{t}^{t+1} = (I - (K+\Delta K)^{-1}K)\hat{t}^k + (K+\Delta K)^{-1}g \]