Deep Implicit Layers: Differentiable Convex Layer

Introduction

In this post, I summarize my learning note on another type of implicit model - the differentiable convex layer. This layer has similar formulation, forward, and backward passes as the fixed point layer. This is because its implicit function can be formulated by the KKT condition, no matter which off-the-shelf optimization solver is used. This topic is related to the tangent and adjoint sensitivity analysis of nonlinear system as the KKT condition forms a nonlinear system in general. The main reference is the paper Differentiable Convex Optimization Layers by Amos et al. (2019) and the blog: differentiable optimization.

Formulation on Convex Optimization Layer

Consider the following optimization proble: \[ \begin{aligned} \underset{z}{\operatorname{minimize}} & f(z) \\ \text { subject to } & g(z) \leq 0 \\ & h(z)=0, \end{aligned} \]

To make the optimization problem convex, we assume that \(f(z)\) is convex, \(g(z)\) is convex, and \(h(z)\) is affine. The KKT condition of this optimization problem is \[ \begin{aligned} g\left(z^{\star}\right) & \leq 0 \\ h\left(z^{\star}\right) & =0 \\ \lambda^{\star} & \geq 0 \\ \lambda^{\star} \circ g\left(z^{\star}\right) & =0 \\ \nabla f\left(z^{\star}\right)+\sum_{i=1}^m \lambda_i^{\star} \nabla g_i\left(z^{\star}\right)+\sum_{i=1}^p \nu_i^{\star} \nabla h_i\left(z^{\star}\right) & =0 \end{aligned} \]

Similarly to the previous parametric setting, we can define the convex layer as \[ \begin{aligned} z^{\star}(x)=\underset{z}{\operatorname{argmin}} & f(z, x) \\ \text { subject to } & g(z, x) \leq 0 \\ & h(z, x)=0 \end{aligned} \] where \(z\) is the decision variable and \(x\) is the input of the layer or the parameter of the optimization problem. Apart from \(z^\star(x)\) (as an implicit function of \(x\)), we can also include the dual variables, \[ (z^\star, \lambda^\star, \nu^\star)(x) \]

The differentiable convex layer can be viewed as 1) a nonlinear implicit equation \(G(x,\lambda,\nu,x) = 0\) and 2) a fixed point iteration as iterative optimization algorithm (e.g., interior point, SQP, ADMM, etc) can be used to solve the optimization problem.

The equality condition KKT can be written as: \[ G(z, \lambda, \nu,x)=\left[\begin{array}{c} \frac{\partial}{\partial z} f(z, x)+\frac{\partial}{\partial z} g(z, x)^T \lambda+\frac{\partial}{\partial z} h(z, x)^T \nu \\ \lambda \circ g(z,x) \\ h(z,x) \end{array}\right] \] where \(\circ\) denotes the element-wise product.

Assume that the optimal primal and dual pair has been found by the forward pass. At the optimal point, we have \[ G(z^\star(x), \lambda^\star(x), \nu^\star(x), x) = 0 \] which is an implicit function of \(x\).

General Formulation

The KKT condition is exactly the implicit function defined in the deep equilibrium model (DEQ) (We will also show that the interrior-point method, which is basically the Newton’s method, can be used to solve the KKT condition later).

Denote \(y(x) = (z^\star(x), \lambda^\star(x), \nu^\star(x))\), then we can directly use the method implemented in nonlinear equation sensitivity analysis to compute the gradient of \(y(x)\) with respect to \(x\). Do the total derivative with respect to \(x\): \[ \frac{d G}{d x} = \frac{\partial G}{\partial y}\frac{d y}{d x} + \frac{\partial G}{\partial x} = 0 \]

Therefore, the Jacobian of the decision \(y\) with respect to the input \(x\) is found as \[ \frac{d y}{d x} = -\left(\frac{\partial G}{\partial y}\right)^{-1} \frac{\partial G}{\partial x} \]

In the backward pass, the adjoint of the optimal point (denote as \(\dot{y}\)) is computed backpropagated by the scalar loss \(\ell(\cdot)\) as \(\dot{y} = \frac{\partial \ell}{\partial y}\). Note that when \(\ell(\cdot)\) does not depend on the primal and dual variables, the associated adjoint is zero.

Consequently, \[ \frac{d y}{d x} = - \frac{d \ell}{d y} \left(\frac{\partial G}{\partial y}\right)^{-1} \frac{\partial G}{\partial x} \]

The same treatment as in fixed point iteration can be used to compute the above gradient. In detail, the first two items form the adjoint equation of \(y\) whose adjoint (or gradient) needs to be modified. \[ \left(\frac{\partial G}{\partial y}\right)^T \dot{y} = -\frac{\partial \ell}{\partial y} \tag{1}\]

Detailed Formulation

The partial Jacobian \(\frac{\partial G}{\partial y}\) can be analytically computed as \[ \frac{\partial}{\partial z, \lambda, \nu} G\left(z^{\star}, \lambda^{\star}, \nu^{\star}, x\right)= \left[\begin{array}{ccc} \frac{\partial^2}{\partial z^2} f\left(z^{\star}, x\right)+\sum_{i=1}^m \lambda_i^{\star} \frac{\partial^2}{\partial z^2} g_i\left(z^{\star}, x\right) & \frac{\partial}{\partial z} g\left(z^{\star}, x\right) & \frac{\partial}{\partial z} h\left(z^{\star}, x\right) \\ \frac{\partial}{\partial z} g\left(z^{\star}, x\right)^T \operatorname{diag}\left(\lambda^{\star}\right) & \operatorname{diag}\left(g\left(z^{\star}, x\right)\right) & 0 \\ \frac{\partial}{\partial z} h\left(z^{\star}, x\right)^T & 0 & 0 \end{array}\right] \]

Quadratic Programming Formulation

In the OptNet paper, the author implemented the differentiable QP layer. In this section, the explicit formulations are derived. We will follow the same notation in this paper.

Consider the following QP problem: \[ \begin{aligned} \min_z & \quad \frac{1}{2} z^T Q z + q^T z \\ \text{s.t.} & \quad A z = b \\ & \quad G z \leq h \end{aligned} \] where the parameters \(Q, q, A, b, C, h\) are functions of the input \(z_i\) (the term \(z_i\) is used as in the paper this is referred to the output from the previous layer). Our goal is to compute the Jacobian of \(\frac{d \ell}{d z_i}\).

The KKT condition (only the equality part) of the above QP can be written as \[ \begin{aligned} Qz^\star + q + A^T \nu^\star + G^T \lambda^\star &= 0 \\ Az^\star - b &= 0 \\ D(\lambda^\star) (Gz^\star - h) &= 0 \end{aligned} \] where \(D(\lambda^\star)\) is a diagonal matrix with the diagonal elements being \(\lambda^\star\).

Differentiating the KKT equations (this is TOTAL derivative!) with respect to the input \(z_i\) gives (it is assumed that all parameters are dependent on \(z_i\)): \[ \begin{aligned} \frac{dQ}{dz_i}z^\star + Q\frac{dz^\star}{dz_i} + \frac{dq}{dz_i} + \frac{dA^T}{dz_i}\nu^\star + A^T \frac{d\nu^\star}{dz_i} + \frac{dG^T}{dz_i}\lambda^\star + G^T \frac{d\lambda^\star}{dz_i} &= 0 \\ \frac{dA}{dz_i}z^\star + A\frac{dz^\star}{dz_i} - \frac{db}{dz_i} &= 0 \\ D(Gz^\star - h)\frac{d\lambda^\star}{dz_i} + D(\lambda^\star)\left(\frac{dG}{dz_i}z^\star + G\frac{dz^\star}{dz_i} - \frac{dh}{dz_i} \right) &=0 \end{aligned} \] which can be compactly written as matrix form \[ \begin{bmatrix} Q & G^T & A^T \\ D(\lambda^\star)G & D(Gz^\star - h) & 0 \\ A & 0 & 0 \end{bmatrix} \begin{bmatrix} \frac{dz}{dz_i} \\ \frac{d\lambda}{dz_i} \\ \frac{d\nu}{dz_i} \end{bmatrix} = -\begin{bmatrix} \frac{dQ}{dz_i}z^\star + \frac{dq}{dz_i} + \frac{dA^T}{dz_i}\nu^\star + \frac{dG^T}{dz_i}\lambda \\ D(\lambda^\star)\frac{dG}{dz_i}z^\star - D(\lambda^\star)\frac{dh}{dz_i} \\ \frac{dA}{dz_i}z - \frac{db}{dz_i} \end{bmatrix} \]

The paper also suggest using the adjoint method to compute the gradient of the loss with respect to the input \(z_i\). For example, in most cases, \(\ell\) is not dependent on \(\lambda\) and \(\nu\), so the adjoint equation (after transpose) is \[ \dot{y} = \left[\begin{array}{l} d_z \\ d_\lambda \\ d_\nu \end{array}\right]= - \left[\begin{array}{ccc} Q & G^T D\left(\lambda^{\star}\right) & A^T \\ G & D\left(G z^{\star}-h\right) & 0 \\ A & 0 & 0 \end{array}\right]^{-1}\left[\begin{array}{c} \left(\frac{\partial \ell}{\partial z^{\star}}\right)^T \\ 0 \\ 0 \end{array}\right] \]

Derive eq.(8) in the paper requires matrix differentiation. For instance, we can obtain \(\frac{d\ell}{dA} = \frac{d\ell}{dz} \frac{dz}{dA}\) by differentiating the KKT condition with respect to \(A\).

Efficient Forward and Backward Pass

Today’s state-of-the-art QP solvers like Gurobi and CPLEX do not have the capability of solving multiple optimization problems on the GPU in parallel across the entire minibatch.

The paper develops a GPU-based primal-dual interior point method (PDIPM). It considers the following QP with slack variables \(s\): \[ \begin{aligned} \min_{z, s} & \quad \frac{1}{2} z^T Q z + q^T z \\ \text{s.t.} & \quad A z = b \\ & \quad G z + s = h \\ & \quad s \geq 0 \end{aligned} \]

Then the KKT condition can be written as \[ \begin{aligned} Qz + q + A^T \nu + G^T \lambda &= 0 \\ Az - b &= 0 \\ Gz + s - h &= 0 \\ \lambda \circ s &= 0 \end{aligned} \]

The last condition is exactly the complementarity slackness.

A Newton’s method can be used to solve the above system, with some modification, it can becomes the PDIPM.

Taking the first-order expansion of the KKT condition, we have \[ K\left[\begin{array}{c} \Delta z^{\text {aff }} \\ \Delta s^{\text {aff }} \\ \Delta \lambda^{\text {aff }} \\ \Delta \nu^{\text {aff }} \end{array}\right]=\left[\begin{array}{c} -\left(A^T \nu+G^T \lambda+Q z+q\right) \\ -S \lambda \\ -(G z+s-h) \\ -(A z-b) \end{array}\right] \] where \[ K=\left[\begin{array}{cccc} Q & 0 & G^T & A^T \\ 0 & D(\lambda) & D(s) & 0 \\ G & I & 0 & 0 \\ A & 0 & 0 & 0 \end{array}\right] \]

The solution (with some modification) will becomes the update direction.

There are several modifications that can be made to improve the efficiency of the PDIPM. First, a symmetric KKT matrix can be used that can be easily solved, by scaling the second block equation by \(D(1/S)\). Then the second block row becomes \[ D(\lambda/S) \Delta s^{\text {aff }} + \Delta \lambda^{\text {aff }} = - \lambda \] and \[ K_{sym} = \left[\begin{array}{cccc} Q & 0 & G^T & A^T \\ 0 & D(\lambda/S) & I & 0 \\ G & I & 0 & 0 \\ A & 0 & 0 & 0 \end{array}\right] \] which is symmetric.

Second, as the PDIPM requires to solve linear system during the iteration, the paper uses the block factorization technique and pre-factorize portions of them that don’t change during the iteration.

Third, because the KKT condition represents a nonlinear implicit function, the backpropagation can be found by the implicit function theorem. The paper also finds out that the above matrix (that needs to be inverted when solving the adjoint equation) has been already computed/factorized during the forward pass using the interior point method. Therefore, the computational cost of backpropagation is ‘almost free’. This finding is similar to the fixed point iteration where the Jacobian of the fixed point iteration is already computed during the forward pass using the Newton’s method.

Steps

Similar to the fixed point iteration. The steps of implementing the differentiable convex layer are:

Outside the gradient tape, solve the optimization problem to find the optimal primal and dual variables \((z^\star, \lambda^\star, \nu^\star)\) using any off-the-shelf optimization solver, such as interior point, SQP, ADMM, etc.
Inside the gradient tape, engage the input \(x\) to the computation graph by \((z,\lambda,\nu):= (z^\star,\lambda^\star,\nu^\star) - G(z^\star,\lambda^\star,\nu^\star,x)\). This will provide the the gradient \(-\frac{d G}{d x}\) in the computation graph.
Register the gradient of \((z,\lambda,\nu)\) with the solution of the linear adjoint system Equation 1.

Pytorch Implementation

An example code snippet, including the an implementation of OptNet and comparison to CvxpyLayers can be found at my-github.