This post summarizes the knowledge from the book Matrix Computation authored by Gene and Charles amd from some other refernces.
Singular Value Decomposition (SVD)
Theorem 1 (SVD) If \(A\) is a real \(m \times n\) matrix, then there exist orthogonal matrices \[ U = \begin{bmatrix} u_1 & \cdots & u_m \end{bmatrix}, \quad V = \begin{bmatrix} v_1 & \cdots & v_n \end{bmatrix} \] such that \[ U^T A V = \Sigma = \operatorname{diag}(\sigma_1, \ldots, \sigma_p) \in \mathbb{R}^{m \times n},\quad p = \min(m, n), \] where \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_p \geq 0\) are the singular values of \(A\). The columns of \(U\) and \(V\) are the left and right singular vectors of \(A\), respectively.
Corollary 1 If \(U^T A V = \Sigma\) is the SVD of \(A\in\mathbb{R}^{m\times n}\) and \(m \geq n\), then for \(i=1:n\), \(Av_i = \sigma_i u_i\) and \(A^Tu_i = \sigma_i v_i\).
Corollary 2 If \(A\in\mathbb{R}^{m\times n}\), then \[ \|A\|_2 = \sigma_1, \quad \|A\|_F = \sqrt{\sigma_1^2 + \cdots + \sigma_p^2}. \]
Proof. The proof directly follows from the fact that the unitary transformation preserve both the 2-norm and Frobenius norm.
Corollary 3 (Subspaces) If $A has \(r\) positive singular values, then \(\operatorname{rank}(A) = r\) and \[ \begin{aligned} \operatorname{range}(A) &= \operatorname{span}\{u_1, \ldots, u_r\}, \\ \operatorname{null}(A) &= \operatorname{span}\{v_{r+1}, \ldots, v_n\} \\ \operatorname{range}(A^T) &= \operatorname{span}\{v_1, \ldots, v_r\}, \\ \operatorname{null}(A^T) &= \operatorname{span}\{u_{r+1}, \ldots, u_m\} \end{aligned} \]
Corollary 4 (Symmetric Matrices) If \(A\) is a symmetric matrix, then \(A = U \Sigma U^T\). Trivially, \(\operatorname{range}(A) = \operatorname{range}(A^T)\) and \(\operatorname{null}(A) = \operatorname{null}(A^T)\).
The SVD on sysmmetric matrices can be found by torch.linalg.eigh
in PyTorch.
Corollary 5 (Connection to Eigenvalues) \[ A^TA = V \Sigma^T U^T U \Sigma V^T = V \operatorname{diag}(\sigma_1^2,\cdots,\sigma_n^2) V^T, \quad AA^T = U \Sigma V^T V \Sigma U^T = U \operatorname{diag}(\sigma_1^2,\cdots,\sigma_m^2) U^T \]
Projections
\(P\in\mathbb{R}^{n\times n}\) is the orthogonal projection onto \(S\) if \(\operatorname{range}(P) = S\) and \(P^2 = P\) and \(P^T = P\). It can be proved that the orthogonal projection for a subspace is unique and if the columns of \(V\) are orthonormal basis of \(S\), then the unique orthogonal projection operator is \(V V^T\).
Suppose \(A\in\mathbb{R}^{m\times n}\) has SVD \(A = U\Sigma V^T\) and \(\operatorname{rank}(A) = r\), then the \(U\) and \(V\) can be partitioned as \[ U = \begin{bmatrix} U_1 & U_2 \end{bmatrix}, \quad V = \begin{bmatrix} V_1 & V_2 \end{bmatrix} \]
Then
- \(V_1 V_1^T\) is the orthogonal projection onto \(\operatorname{range}(A^T)\).
- \(V_2 V_2^T\) is the orthogonal projection onto \(\operatorname{null}(A)\).
- \(U_1 U_1^T\) is the orthogonal projection onto \(\operatorname{range}(A)\).
- \(U_2 U_2^T\) is the orthogonal projection onto \(\operatorname{null}(A^T)\).
Least Squares
Full Rank Case
- Full Column Rank
If \(A\in\mathbb{R}^{m\times n}\) has full column rank, then the least squares solution of \(Ax = b\) is to find \(x\) such that \(\|Ax - b\|_2\) is minimized. This is because \(b\) may not be in the range of \(A\) and the solution is to find the projection of \(b\) onto the range of \(A\). This is an unconstrained optimization problem \[ \min_x \frac{1}{2} \|Ax - b\|_2^2 \]
The first-order optimality condition is \[ A^T(Ax - b) = 0 \]
Because \(A\) has full column rank, \(A^TA\) is invertible and the solution is \[ x_{LS} = (A^TA)^{-1}A^Tb \]
Alternatively, the LS solution can be directly derived from the definition of SVD. \[ \begin{aligned} \|Ax - b\|_2 & = \|U^T(Ax - b)\|_2 = \|\Sigma V^Tx - U^Tb\|_2 \\ & = \sum_{i=1}^n (\sigma_i v_i^Tx - u_i^Tb)^2 + \sum_{i=n+1}^m (u_i^Tb)^2 \end{aligned} \]
The minimizer is \[ x_{LS} = \sum_{i=1}^n \frac{u_i^Tb}{\sigma_i}v_i \]
Note that in this case \(A = U_1\Sigma_1 V^T\) and \(x_{LS} = V\Sigma_1^{-1}U_1^Tb\).
- Full Row Rank
In this case we have \(y\in\operatorname{range}(A)\). The minimum norm solution is to find \[ \min_x \|x\|_2 \quad \text{s.t.} \quad Ax = b \]
From the first-order optimality condition, we can derive \(\lambda = -(AA^T)^{-1}y\) and \(x^\star = A^T(AA^T)^{-1}y\).
We can indeed check that this \(x^\star\) has the minimum norm for all solutions \(x\) satisfying \(Ax = b\). To start, we have \(A(x - x^\star) = 0\). Then, \[ \begin{aligned} (x-x^\star)^Tx^\star &= (x-x^\star)^TA^T(AA^T)^{-1}y \\ & = (A(x-x^\star))^T(AA^T)^{-1}y = 0 \end{aligned} \]
Therefore, \((x-x^\star) \perp x^\star\). So, \[ \|x\|^2 = \|x^\star\|^2 + \|x - x^\star\|^2 \geq \|x^\star\|^2 \] which means that \(x^\star\) has the minimum norm of any solution.
Rank Deficient Case
When \(A\) has rank \(r < n\), the least squares solution is not unique. If \(x\) is a minimizer and \(z\) is in the null space of \(A\), then \(x + z\) is also a minimizer. In this case, \[ x_{LS} = \sum_{i=1}^r \frac{u_i^Tb}{\sigma_i}v_i = V_1^T \Sigma_1^{-1} U_1^T b \] minimizes \(\|Ax - b\|_2\) and has the smallest 2-norm among all minimizers. The proof directly follows from the full rank case. E.g., \[ \begin{aligned} \|Ax - b\|_2 & = \|U^T(Ax - b)\|_2 = \|\Sigma V^Tx - U^Tb\|_2 \\ & = \sum_{i=1}^r (\sigma_i v_i^Tx - u_i^Tb)^2 + \sum_{i=r+1}^m (u_i^Tb)^2 \end{aligned} \]
Note that least square minimum norm solution means that the solution should be first a least square solution!
Pseudo-Inverse
In general, let \(A\in\mathbb{R}^{m\times n}\) and \(A = U\Sigma V^T\) be the SVD of \(A\). The pseudo-inverse of \(A\) is defined as \(A^+ = V\Sigma^+U^T\) with \[ \Sigma = \operatorname{diag}(\sigma_1, \ldots, \sigma_r, 0, \ldots, 0), \quad \Sigma^+ = \operatorname{diag}(\sigma_1^{-1}, \ldots, \sigma_r^{-1}, 0, \ldots, 0) \] and \(x_{LS} = A^+b\) is the minimum norm least squares solution to \(Ax = b\).
- If \(\operatorname{rank}(A) = n\) (this infers that \(m\geq n\)), then \(A^+ = (A^TA)^{-1}A^T\) is the left inverse of \(A\).
- If \(\operatorname{rank}(A) = m\) (this infers that \(m\leq n\)), then \(A^+ = A^T(AA^T)^{-1}\) is the right inverse of \(A\).
- If \(\operatorname{rank}(A) = m = n\) (square matrix), then \(A^+ = A^{-1}\) is the inverse of \(A\).
Moreover, the pseudo-inverse can be used to find the projection operator:
- \(AA^+ = U_1 U_1^T\) is the orthogonal projection onto \(\operatorname{range}(A)\).
- \(A^+A = V_1 V_2^T\) is the orthogonal projection onto \(\operatorname{range}(A^T)\).
- \(I-AA^+ = U_2U_2^T\) is the orthogonal projection onto \(\operatorname{null}(A^T)\).
- \(I-A^+A = V_2V_2^T\) is the orthogonal projection onto \(\operatorname{null}(A)\).