5. Singular Value Decomposition (SVD)#
5.1. Overview#
The singular value decomposition (SVD) is a work-horse in applications of least squares projection that form foundations for many statistical and machine learning methods.
After defining the SVD, we’ll describe how it connects to
four fundamental spaces of linear algebra
under-determined and over-determined least squares regressions
principal components analysis (PCA)
Like principal components analysis (PCA), DMD can be thought of as a data-reduction procedure that represents salient patterns by projecting data onto a limited set of factors.
In a sequel to this lecture about Dynamic Mode Decompositions, we’ll describe how SVD’s provide ways rapidly to compute reduced-order approximations to first-order Vector Autoregressions (VARs).
5.2. The Setting#
Let
Necessarily,
In much of this lecture, we’ll think of
each column is an individual – a time period or person, depending on the application
each row is a random variable describing an attribute of a time period or a person, depending on the application
We’ll be interested in two situations
A short and fat case in which
, so that there are many more columns (individuals) than rows (attributes).A tall and skinny case in which
, so that there are many more rows (attributes) than columns (individuals).
We’ll apply a singular value decomposition of
In the
In this
In the
We’ll again use a singular value decomposition, but now to construct a dynamic mode decomposition (DMD)
5.3. Singular Value Decomposition#
A singular value decomposition of an
where
and
is an orthogonal matrix of left singular vectors ofColumns of
are eigenvectors of is an orthogonal matrix of right singular vectors ofColumns of
are eigenvectors of is an matrix in which the first places on its main diagonal are positive numbers called singular values; remaining entries of are all zeroThe
singular values are positive square roots of the eigenvalues of the matrix and also of the matrixWe adopt a convention that when
is a complex valued matrix, denotes the conjugate-transpose or Hermitian-transpose of , meaning that is the complex conjugate of .Similarly, when
is a complex valued matrix, denotes the conjugate-transpose or Hermitian-transpose of
The matrices
multiplying vectors by the unitary matrices
and rotates them, but leaves angles between vectors and lengths of vectors unchanged.multiplying vectors by the diagonal matrix
leaves angles between vectors unchanged but rescales vectors.
Thus, representation (5.1) asserts that multiplying an
rotating
by computingrescaling
by multiplying it byrotating
by multiplying it by
This structure of the
Thus,
is an encoder is an operator to be applied to the encoded data is a decoder to be applied to the output from applying operator to the encoded data
We’ll apply this circle of ideas later in this lecture when we study Dynamic Mode Decomposition.
Road Ahead
What we have described above is called a full SVD.
In a full SVD, the shapes of
Later we’ll also describe an economy or reduced SVD.
Before we study a reduced SVD we’ll say a little more about properties of a full SVD.
5.4. Four Fundamental Subspaces#
Let
Let’s start by recalling the four fundamental subspaces of an
The column space of
, denoted , is the span of the columns of , i.e., all vectors that can be written as linear combinations of columns of . Its dimension is .The null space of
, denoted consists of all vectors that satisfy . Its dimension is .The row space of
, denoted is the column space of . It consists of all vectors that can be written as linear combinations of rows of . Its dimension is .The left null space of
, denoted , consist of all vectors such that . Its dimension is .
For a full SVD of a matrix
They form two pairs of orthogonal subspaces that we’ll describe now.
Let
Let’s write the full SVD of X as
where
Representation (5.2) implies that
or
or
Equations (5.4) tell how the transformation
Equations (5.3) assert that
Taking transposes on both sides of representation (5.2) implies
or
or
Notice how equations (5.6) assert that the transformation
Equations (5.5) assert that
Thus, taken together, the systems of equations (5.3) and (5.5)
describe the four fundamental subspaces of
Since
is an orthonormal basis for the column space of is an orthonormal basis for the null space of is an orthonormal basis for the row space of is an orthonormal basis for the null space of
We have verified the four claims in (5.7) simply by performing the multiplications called for by the right side of (5.2) and reading them.
The claims in (5.7) and the fact that
the column space of
is orthogonal to the null space ofthe null space of
is orthogonal to the row space of
Sometimes these properties are described with the following two pairs of orthogonal complement subspaces:
is the orthogonal complement of is the orthogonal complement
Let’s do an example.
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
Having imported these modules, let’s do the example.
np.set_printoptions(precision=2)
# Define the matrix
A = np.array([[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8],
[5, 6, 7, 8, 9]])
# Compute the SVD of the matrix
U, S, V = np.linalg.svd(A,full_matrices=True)
# Compute the rank of the matrix
rank = np.linalg.matrix_rank(A)
# Print the rank of the matrix
print("Rank of matrix:\n", rank)
print("S: \n", S)
# Compute the four fundamental subspaces
row_space = U[:, :rank]
col_space = V[:, :rank]
null_space = V[:, rank:]
left_null_space = U[:, rank:]
print("U:\n", U)
print("Column space:\n", col_space)
print("Left null space:\n", left_null_space)
print("V.T:\n", V.T)
print("Row space:\n", row_space.T)
print("Right null space:\n", null_space.T)
Rank of matrix:
2
S:
[2.69e+01 1.86e+00 1.20e-15 2.24e-16 5.82e-17]
U:
[[-0.27 -0.73 0.63 -0.06 0.06]
[-0.35 -0.42 -0.69 -0.45 0.12]
[-0.43 -0.11 -0.24 0.85 0.12]
[-0.51 0.19 0.06 -0.1 -0.83]
[-0.59 0.5 0.25 -0.24 0.53]]
Column space:
[[-0.27 -0.35]
[ 0.73 0.42]
[ 0.32 -0.65]
[ 0.54 -0.39]
[-0.06 -0.35]]
Left null space:
[[ 0.63 -0.06 0.06]
[-0.69 -0.45 0.12]
[-0.24 0.85 0.12]
[ 0.06 -0.1 -0.83]
[ 0.25 -0.24 0.53]]
V.T:
[[-0.27 0.73 0.32 0.54 -0.06]
[-0.35 0.42 -0.65 -0.39 -0.35]
[-0.43 0.11 0.02 -0.29 0.85]
[-0.51 -0.19 0.61 -0.41 -0.4 ]
[-0.59 -0.5 -0.31 0.55 -0.04]]
Row space:
[[-0.27 -0.35 -0.43 -0.51 -0.59]
[-0.73 -0.42 -0.11 0.19 0.5 ]]
Right null space:
[[-0.43 0.11 0.02 -0.29 0.85]
[-0.51 -0.19 0.61 -0.41 -0.4 ]
[-0.59 -0.5 -0.31 0.55 -0.04]]
5.5. Eckart-Young Theorem#
Suppose that we want to construct the best rank
By best, we mean a matrix
where
Three popular matrix norms of an
the spectral or
normthe Frobenius norm
the nuclear norm
The Eckart-Young theorem states that for each of these three norms, same rank
This is a very powerful theorem that says that we can take our
Moreover, if some of these
We’ll say more about this later when we present Principal Component Analysis.
You can read about the Eckart-Young theorem and some of its uses here.
We’ll make use of this theorem when we discuss principal components analysis (PCA) and also dynamic mode decomposition (DMD).
5.6. Full and Reduced SVD’s#
Up to now we have described properties of a full SVD in which shapes of
There is an alternative bookkeeping convention called an economy or reduced SVD in which the shapes of
Thus, note that because we assume that
A reduced SVD uses this fact to express
You can read about reduced and full SVD here https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html
For a full SVD,
But not all these properties hold for a reduced SVD.
Which properties hold depend on whether we are in a tall-skinny case or a short-fat case.
In a tall-skinny case in which
, for a reduced SVD
In a short-fat case in which
, for a reduced SVD
When we study Dynamic Mode Decomposition below, we shall want to remember these properties when we use a reduced SVD to compute some DMD representations.
Let’s do an exercise to compare full and reduced SVD’s.
To review,
in a full SVD
is is is
in a reduced SVD
is is is
First, let’s study a case in which
(This is a small example of the tall-skinny case that will concern us when we study Dynamic Mode Decompositions below.)
import numpy as np
X = np.random.rand(5,2)
U, S, V = np.linalg.svd(X,full_matrices=True) # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V =')
U, S, V
U, S, V =
(array([[-0.59, 0.32, -0.68, -0.22, 0.17],
[-0.47, -0.35, 0.26, -0.51, -0.58],
[-0.42, 0.57, 0.67, 0.07, 0.24],
[-0.47, -0.33, -0.04, 0.8 , -0.14],
[-0.18, -0.59, 0.14, -0.2 , 0.75]]),
array([1.79, 0.45]),
array([[-0.88, -0.47],
[ 0.47, -0.88]]))
print('Uhat, Shat, Vhat = ')
Uhat, Shat, Vhat
Uhat, Shat, Vhat =
(array([[-0.59, 0.32],
[-0.47, -0.35],
[-0.42, 0.57],
[-0.47, -0.33],
[-0.18, -0.59]]),
array([1.79, 0.45]),
array([[-0.88, -0.47],
[ 0.47, -0.88]]))
rr = np.linalg.matrix_rank(X)
print(f'rank of X = {rr}')
rank of X = 2
Properties:
Where
is constructed via a full SVD, andWhere
is constructed via a reduced SVD, although , it happens that
We illustrate these properties for our example with the following code cells.
UTU = U.T@U
UUT = U@U.T
print('UUT, UTU = ')
UUT, UTU
UUT, UTU =
(array([[ 1.00e+00, 1.88e-16, 2.23e-16, 1.65e-16, -5.60e-18],
[ 1.88e-16, 1.00e+00, 3.96e-17, 1.61e-16, -3.12e-17],
[ 2.23e-16, 3.96e-17, 1.00e+00, 3.36e-17, -6.92e-17],
[ 1.65e-16, 1.61e-16, 3.36e-17, 1.00e+00, 7.54e-17],
[-5.60e-18, -3.12e-17, -6.92e-17, 7.54e-17, 1.00e+00]]),
array([[ 1.00e+00, -9.71e-17, 1.62e-16, 7.46e-17, -6.64e-17],
[-9.71e-17, 1.00e+00, -2.53e-17, -3.18e-17, -1.39e-16],
[ 1.62e-16, -2.53e-17, 1.00e+00, 5.77e-17, 2.66e-17],
[ 7.46e-17, -3.18e-17, 5.77e-17, 1.00e+00, -4.94e-17],
[-6.64e-17, -1.39e-16, 2.66e-17, -4.94e-17, 1.00e+00]]))
UhatUhatT = Uhat@Uhat.T
UhatTUhat = Uhat.T@Uhat
print('UhatUhatT, UhatTUhat= ')
UhatUhatT, UhatTUhat
UhatUhatT, UhatTUhat=
(array([[ 0.45, 0.17, 0.43, 0.17, -0.08],
[ 0.17, 0.34, -0. , 0.34, 0.29],
[ 0.43, -0. , 0.5 , 0.01, -0.26],
[ 0.17, 0.34, 0.01, 0.33, 0.28],
[-0.08, 0.29, -0.26, 0.28, 0.38]]),
array([[ 1.00e+00, -9.71e-17],
[-9.71e-17, 1.00e+00]]))
Remarks:
The cells above illustrate the application of the full_matrices=True
and full_matrices=False
options.
Using full_matrices=False
returns a reduced singular value decomposition.
The full and reduced SVD’s both accurately decompose an
When we study Dynamic Mode Decompositions below, it will be important for us to remember the preceding properties of full and reduced SVD’s in such tall-skinny cases.
Now let’s turn to a short-fat case.
To illustrate this case, we’ll set
import numpy as np
X = np.random.rand(2,5)
U, S, V = np.linalg.svd(X,full_matrices=True) # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V = ')
U, S, V
U, S, V =
(array([[-0.61, 0.79],
[-0.79, -0.61]]),
array([2.03, 0.58]),
array([[-0.61, -0.57, -0.37, -0.31, -0.26],
[ 0.16, -0.09, -0.66, 0.02, 0.73],
[ 0.14, -0.7 , 0.58, -0.02, 0.4 ],
[-0.27, -0.16, -0.07, 0.95, -0.05],
[-0.72, 0.39, 0.31, -0.09, 0.48]]))
print('Uhat, Shat, Vhat = ')
Uhat, Shat, Vhat
Uhat, Shat, Vhat =
(array([[-0.61, 0.79],
[-0.79, -0.61]]),
array([2.03, 0.58]),
array([[-0.61, -0.57, -0.37, -0.31, -0.26],
[ 0.16, -0.09, -0.66, 0.02, 0.73]]))
Let’s verify that our reduced SVD accurately represents
SShat=np.diag(Shat)
np.allclose(X, Uhat@SShat@Vhat)
True
5.7. Polar Decomposition#
A reduced singular value decomposition (SVD) of
where
Here
is an symmetric matrix is an orthogonal matrix
and in our reduced SVD
is an orthonormal matrix is a diagonal matrix is an orthonormal
5.8. Application: Principal Components Analysis (PCA)#
Let’s begin with a case in which
The matrix
We regard
where for
In a time series setting, we would think of columns
In a cross-section setting, we would think of columns
As we have seen before, the SVD is a way to decompose a matrix into useful components, just like polar decomposition, eigendecomposition, and many others.
PCA, on the other hand, is a method that builds on the SVD to analyze data. The goal is to apply certain steps, to help better visualize patterns in data, using statistical tools to capture the most important patterns in data.
Step 1: Standardize the data:
Because our data matrix may hold variables of different units and scales, we first need to standardize the data.
First by computing the average of each row of
We then create an average matrix out of these means:
And subtract out of the original matrix to create a mean centered matrix:
Step 2: Compute the covariance matrix:
Then because we want to extract the relationships between variables rather than just their magnitude, in other words, we want to know how they can explain each other, we compute the covariance matrix of
Step 3: Decompose the covariance matrix and arrange the singular values:
Since the matrix
The eigendecomposition of
We can then rearrange the columns in the matrices
Step 4: Select singular values, (optional) truncate the rest:
We can now decide how many singular values to pick, based on how much variance you want to retain. (e.g., retaining 95% of the total variance).
We can obtain the percentage by calculating the variance contained in the leading
Step 5: Create the Score Matrix:
5.9. Relationship of PCA to SVD#
To relate an SVD to a PCA of data set
Let’s assume that sample means of all variables are zero, so we don’t need to standardize our matrix.
where
In equation (5.9), each of the
Thus, we have
Here is how we would interpret the objects in the matrix equation (5.10) in a time series context:
, the object is a time series for the th principal component is a vector of loadings of variables on the th principal component, for each is the strength of th principal component, where strength means contribution to the overall covariance of .
5.10. PCA with Eigenvalues and Eigenvectors#
We now use an eigen decomposition of a sample covariance matrix to do PCA.
Let
Let’s assume that sample means of all variables are zero.
We can assure this by pre-processing the data by subtracting sample means.
Define a sample covariance matrix
Then use an eigen decomposition to represent
Here
is matrix of eigenvectors of is a diagonal matrix of eigenvalues of
We can then represent
where
and
We can verify that
It follows that we can represent the data matrix
To reconcile the preceding representation with the PCA that we had obtained earlier through the SVD, we first note that
Now define
Therefore
which agrees with
provided that we set
(a vector of loadings of variables on principal component ) (the th principal component)
Because there are alternative algorithms for computing
We can resolve such ambiguities about
sorting eigenvalues and singular values in descending order
imposing positive diagonals on
and and adjusting signs in accordingly
5.11. Connections#
To pull things together, it is useful to assemble and compare some formulas presented above.
First, consider an SVD of an
Compute:
Compare representation (5.12) with equation (5.11) above.
Evidently,
Second, let’s compute
Thus, the matrix
Summarizing and fitting things together, we have the eigen decomposition of the sample covariance matrix
where
Further, from the SVD of
where
Thus,
It follows that
Note that the preceding implies that
so that everything fits together.
Below we define a class DecomAnalysis
that wraps PCA and SVD for a given a data matrix X
.
class DecomAnalysis:
"""
A class for conducting PCA and SVD.
X: data matrix
r_component: chosen rank for best approximation
"""
def __init__(self, X, r_component=None):
self.X = X
self.Ω = (X @ X.T)
self.m, self.n = X.shape
self.r = LA.matrix_rank(X)
if r_component:
self.r_component = r_component
else:
self.r_component = self.m
def pca(self):
𝜆, P = LA.eigh(self.Ω) # columns of P are eigenvectors
ind = sorted(range(𝜆.size), key=lambda x: 𝜆[x], reverse=True)
# sort by eigenvalues
self.𝜆 = 𝜆[ind]
P = P[:, ind]
self.P = P @ diag_sign(P)
self.Λ = np.diag(self.𝜆)
self.explained_ratio_pca = np.cumsum(self.𝜆) / self.𝜆.sum()
# compute the N by T matrix of principal components
self.𝜖 = self.P.T @ self.X
P = self.P[:, :self.r_component]
𝜖 = self.𝜖[:self.r_component, :]
# transform data
self.X_pca = P @ 𝜖
def svd(self):
U, 𝜎, VT = LA.svd(self.X)
ind = sorted(range(𝜎.size), key=lambda x: 𝜎[x], reverse=True)
# sort by eigenvalues
d = min(self.m, self.n)
self.𝜎 = 𝜎[ind]
U = U[:, ind]
D = diag_sign(U)
self.U = U @ D
VT[:d, :] = D @ VT[ind, :]
self.VT = VT
self.Σ = np.zeros((self.m, self.n))
self.Σ[:d, :d] = np.diag(self.𝜎)
𝜎_sq = self.𝜎 ** 2
self.explained_ratio_svd = np.cumsum(𝜎_sq) / 𝜎_sq.sum()
# slicing matrices by the number of components to use
U = self.U[:, :self.r_component]
Σ = self.Σ[:self.r_component, :self.r_component]
VT = self.VT[:self.r_component, :]
# transform data
self.X_svd = U @ Σ @ VT
def fit(self, r_component):
# pca
P = self.P[:, :r_component]
𝜖 = self.𝜖[:r_component, :]
# transform data
self.X_pca = P @ 𝜖
# svd
U = self.U[:, :r_component]
Σ = self.Σ[:r_component, :r_component]
VT = self.VT[:r_component, :]
# transform data
self.X_svd = U @ Σ @ VT
def diag_sign(A):
"Compute the signs of the diagonal of matrix A"
D = np.diag(np.sign(np.diag(A)))
return D
We also define a function that prints out information so that we can compare decompositions obtained by different algorithms.
def compare_pca_svd(da):
"""
Compare the outcomes of PCA and SVD.
"""
da.pca()
da.svd()
print('Eigenvalues and Singular values\n')
print(f'λ = {da.λ}\n')
print(f'σ^2 = {da.σ**2}\n')
print('\n')
# loading matrices
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
plt.suptitle('loadings')
axs[0].plot(da.P.T)
axs[0].set_title('P')
axs[0].set_xlabel('m')
axs[1].plot(da.U.T)
axs[1].set_title('U')
axs[1].set_xlabel('m')
plt.show()
# principal components
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
plt.suptitle('principal components')
axs[0].plot(da.ε.T)
axs[0].set_title('ε')
axs[0].set_xlabel('n')
axs[1].plot(da.VT[:da.r, :].T * np.sqrt(da.λ))
axs[1].set_title('$V^\top *\sqrt{\lambda}$')
axs[1].set_xlabel('n')
plt.show()
<>:32: SyntaxWarning: invalid escape sequence '\s'
<>:32: SyntaxWarning: invalid escape sequence '\s'
/tmp/ipykernel_33443/4223624786.py:32: SyntaxWarning: invalid escape sequence '\s'
axs[1].set_title('$V^\top *\sqrt{\lambda}$')
5.12. Exercises#
Exercise 5.1
In Ordinary Least Squares (OLS), we learn to compute
In these cases, the
What we can do instead is to create what is called a pseudoinverse, a full rank approximation of the inverted matrix so we can compute
Thinking in terms of the Eckart-Young theorem, build the pseudoinverse matrix
Solution to Exercise 5.1
We can use SVD to compute the pseudoinverse:
inverting
where:
and finally:
For an example PCA applied to analyzing the structure of intelligence tests see this lecture Multivariable Normal Distribution.
Look at parts of that lecture that describe and illustrate the classic factor analysis model.
As mentioned earlier, in a sequel to this lecture about Dynamic Mode Decompositions, we’ll describe how SVD’s provide ways rapidly to compute reduced-order approximations to first-order Vector Autoregressions (VARs).