# functions for numerical analysis
source("https://drive.google.com/uc?export=download&id=10dNH3VbvxS8Z3OHjP4i9gRbtsf91VVBb")
# load the data
source("https://drive.google.com/uc?export=download&id=14EU7jxS8LcQ29WOvzAevXigl2wloccqF")
require(expm)
require(pracma)
The following project provides multiple machine learning approaches to recognize handwritten digits (0-9) using Least Squares and the SVD. Here is a description of the data we are using to do this:
train_data
: a \(256 \times 4649\) matrix where each column is a \(16 \times 16\) image that has been flattened into a \(256 \times 1\) vector. Note that the data has been sorted, starting with images corresponding to 0, then 1, then 2, up to 9.
indices
: a \(10 \times 1\) vector which indicates the index where each new digit in train_data
starts.
train_data_1000
: a \(256 \times 1000\) matrix that stores 100 samples of each digit in the columns, sorted in the order from 0 to 9.
test_data
: a \(256 \times 4649\) matrix on which we would like to predict the digit
test_label
: a \(4649\)-dimensional vector revealing the true label for each of the columns in test_data
(this allows us to compare how accurate our prediction is).
Note that it is assumed that the reader is familiar with common techniques of numerical analysis and computational linear algebra including the Normal Equations, Least Squares, the QR algorithm and its variants as well as the Singular Value Decomposition (SVD). For background on these topics, Timothy Sauer’s “Numerical Analysis” is recommended.
To investigate the data a bit, we start by extracting the columns of train_data
corresponding to the first three values in indices
. Then, we reform each column as a \(16 \times 16\) matrix using the imPlot
command in imported above to view an image of the transpose of each matrix. The result is images of the digits 0, 1 and 2.
# save the three matrices as img0, img1, and img2
img0 <- matrix(train_data[,indices[1]], nrow = 16)
img1 <- matrix(train_data[,indices[2]], nrow = 16)
img2 <- matrix(train_data[,indices[3]], nrow = 16)
# plot the images
imPlot(t(img0))
imPlot(t(img1))
imPlot(t(img2))
First we explore the method of least squares for classification. For each digit between 0 and 9, we will set up a system of equations
\[ A_0x=b, \quad A_1x=b, \quad \ldots, \quad A_9 x = b, \]
where \(A_i\) corresponds to the flattened images of the \(i\)th digit in the training data. Then, for each new flattened image \(b\) to be classified in test_data
, we solve the system of equations \(A_ix=b\) to find \(x^*\), determine \(\hat{b}=A_ix^*\), and compute the error as the norm of the residual. The target vector \(b\) should be classified as digit \(i\) with the smallest error.
There are a few important notes regarding how we should use least squares in this context. First, we cannot use least squares on the full training data since the point of employing the method is to distinguish or characterize the various digits. For example, if \(A_1\) is constructed using all of the training data, then any new training data that is projected into the span of its columns using least squares will always give a resulting residual vector with small magnitude. In other words, we won’t have any means of effectively training the model.
Second, note that each linear system of the form \(A_ix = b\) is underdetermined. Equivalently, each matrix \(A_i\) corresponds to a linear system with more unknowns than equations, or equivalently more columns than rows. This means that each \(A_i\) cannot be a surjective mapping, implying that least-squares is a good approach to solve this problem. To make this more explicit, consider the following code chunk:
diffs <- c()
k <- length(indices)-1
for(i in 1:k) {
diffs[i] = indices[i+1]-indices[i]
}
print(diffs)
## [1] 767 622 475 406 409 361 420 390 377 422
The above code determines how many columns of the full training data correspond to each digit. Since each matrix \(A_i\) will have exactly \(256\) rows, we see that each matrix \(A_i\) corresponds to a linear system with more unknowns than equations.
We will use train_data_1000
to perform least squares. Below we include three implementations of least squares including the Normal Equations, QR decomposition and via the SVD. Each method will determine, given a target vector/digit, a minimum residual vector. Note that for each approach we must first partition the sampled test data. The Normal Equations approach is straightforward, and just uses R’s solve()
method. The QR approach leverages the magnitude-preserving properties of orthonormal matrices in order to compute the magnitude of the residual vector. In particular, that
\[ ||Ax^* - b||_2 = ||\hat{Q}^{\top}b||_2 \]
when we use the full QR given by \(A = \overline{Q} \ \overline{R}\). For the SVD approach, we again leverage the nice properties of orthonormal matrices to compute the magnitude of the residual without actually computing the residual vector. In particular, that given the full SVD \(A = U \Sigma V^{\top}\) then
\[ ||Ax^* -b||_2 = \sqrt{\sum_{i=r+1}^m c_i^2} \]
where \(c = U^{\top}b\) and \(r\) corresponds to the number of singular values. In this case we have \(r = 100\). Here is the code for all of this:
## b is a sample test vector to be classified
## data is the training data to build all of the systems of equations for each digit (i.e. train_data_1000)
## Least Square with the Normal equations
digit_NormLS= function(b,data=train_data_1000){
bestError = 1000000
digitrec = 100
Ind <- seq(1,1000,by=100)
for(i in (1:10)) {
# get the data
k <- Ind[i]
range <- c(k:(k+99))
A <- data[,range]
# normal equations
M <- t(A) %*% A
v <- t(A) %*% b
# compute residual
x.star <- solve(M,v)
r <- vnorm(b - A%*%x.star)
# update variables
if(r < bestError) {
bestError <- r
digitrec <- i-1
}
}
return(digitrec)
}
## Least Squares with QR Decomposition
digit_QRLS= function(b,data=train_data_1000){
bestError = 1000000
digitrec = 100
Ind <- seq(1,1000,by=100)
for(i in (1:10)) {
# get the data
k <- Ind[i]
range <- c(k:(k+99))
A <- data[,range]
# full QR with R's qr function
out <- qr(A)
Q.bar <- qr.Q(out,complete=TRUE)
n <- ncol(qr.Q(out))
m <- ncol(Q.bar)
Q.hat <- Q.bar[,((n+1):m)]
# compute residual
r <- vnorm(t(Q.hat)%*%b)
# update variables
if(r < bestError) {
bestError <- r
digitrec <- i-1
}
}
return(digitrec)
}
## Least Squares via the SVD
digit_SVDLS= function(b,data=train_data_1000){
bestError = 1000000
digitrec = 100
Ind <- seq(1,1000,by=100)
for(i in (1:10)) {
# get the data
k <- Ind[i]
range <- c(k:(k+99))
A <- data[,range]
# full SVD with R's built-in function
out <- svd(A,nu=256,nv=100)
c <- t(out$u) %*% b
# compute residual
r <- sqrt(sum(c[101:256]^2))
# update variables
if(r < bestError) {
bestError <- r
digitrec <- i-1
}
}
return(digitrec)
}
# tests
digit_NormLS(test_data[,1],train_data_1000)
## [1] 0
digit_QRLS(test_data[,1],train_data_1000)
## [1] 0
digit_SVDLS(test_data[,1],train_data_1000)
## [1] 0
digit_NormLS(test_data[,4321],train_data_1000)
## [1] 4
digit_QRLS(test_data[,4321],train_data_1000)
## [1] 4
digit_SVDLS(test_data[,4321],train_data_1000)
## [1] 4
The code chunk below determines the accuracy and runtime for each of the three least squares methods implemented above using test_data
.
Timing = function(method){
start_time = Sys.time()
correct = 0
wrong = 0
for(i in 1:(ncol(test_data))){
RecognizedDigit = method(as.matrix(test_data[,i]))
if (test_label[i]==RecognizedDigit){
correct = correct + 1
}else{
wrong = wrong + 1
}
}
accuracy = correct / (correct + wrong)
end_time = Sys.time()
return(list(time = end_time - start_time, accuracy = accuracy))
}
Timing(digit_NormLS)
## $time
## Time difference of 1.712391 mins
##
## $accuracy
## [1] 0.9150355
Timing(digit_QRLS)
## $time
## Time difference of 9.534597 mins
##
## $accuracy
## [1] 0.9150355
Timing(digit_SVDLS)
## $time
## Time difference of 15.29439 mins
##
## $accuracy
## [1] 0.9150355
Now we use the SVD for image recognition by constructing a “best basis” for the images of the digits. For each digit \(i\), we compute the best \(J\)-dimensional basis \(U_J^{(i)}\) corresponding to digit \(i\) by using train_data_1000
to build the matrix \(U_{J}^{(i)}\) containing the best J-dimensional basis for each \(i\). The matrices for each of these bases are column binded together into \(U_{sampled}\). (Note: After experimenting with a few different choices, \(J=17\) was chosen as it gives good classification rates.) Thus, \(U_{sampled}\) is of size \(256 \times 170\).
# matrix to construct
U_sampled <- matrix(data=NA,nrow=256,ncol=170)
# indices for the test data and U_sampled matrix
Ind <- seq(1,1000,by=100)
U.ind <- seq(1,170,by=17)
for(i in (1:10)) {
# get the data
j <- Ind[i]
range <- c(j:(j+99))
A <- train_data_1000[,range]
# compute J-dimensional basis with SVD
UJi <- svd(A)$u[,(1:17)]
# append to U_sampled
j <- U.ind[i]
range <- c(j:(j+16))
U_sampled[,(range)] <- UJi
}
Using the bases we constructed above, we now write a function called digit_SVD
which takes as input a target vector \(b\) and the matrix \(U_{sampled}\). We orthogonally project \(b\) into each basis \(U_J^{(i)}\) using the projection matrix
\[ \mathbf{P} = \sum_{i = 0}^J u_iu_i^{\top}. \]
Thus, our projection into each basis is given by \(\mathbf{P}b\), and by the nice properties of orthonormal matrices, the magnitue of the residual (i.e the error) is given by
\[ ||(I-\mathbf{P})b||_2 = ||b - \mathbf{P}b||_2. \] The target vector \(b\) should be classified as digit \(i\) with the smallest error. Here is the code to do this:
digit_SVD = function(b, UJ){
bestError = 1000000
digitrec = 100
# indices for UJ
U.ind <- seq(1,170,by=17)
for(i in (1:10)) {
# get UJi
j <- U.ind[i]
range <- c(j:(j+16))
UJi <- U_sampled[,(range)]
# projection matrix
P <- matrix(data=0,nrow=256,ncol=256)
for(k in (1:17)) {
P <- P + (UJi[,k] %*% t(UJi[,k]))
}
# compute residual
r <- vnorm(b - P%*%b)
# update variables
if(r < bestError) {
bestError <- r
digitrec <- i-1
}
}
return(digitrec)
}
# tests
digit_SVD(test_data[,1],U_sampled)
## [1] 0
digit_SVD(test_data[,4321],U_sampled)
## [1] 4
In this case, we can use the full training data to build the matrix \(U_{J}^{(i)}\) containing the best J-dimensional basis for each \(i\). We column bind these together into \(U_{all}\), and again choose \(J=17\). Thus, \(U_{all}\) is of size \(256 \times 170\).
# matrix to construct
U_all <- matrix(data=NA,nrow=256,ncol=170)
for(i in (1:10)) {
# get the data
j <- indices[i]
jj <- j + diffs[i] - 1
range <- c(j:jj)
A <- train_data[,range]
# compute J-dimensional basis with SVD
UJi <- svd(A)$u[,(1:17)]
# append to U_sampled
j <- U.ind[i]
range <- c(j:(j+16))
U_sampled[,(range)] <- UJi
}
# tests
digit_SVD(test_data[,1],U_all)
## [1] 0
digit_SVD(test_data[,4321],U_all)
## [1] 4
Now we test the accuracy and runtime for the SVD model constructed using \(U_{sampled}\) and \(U_{all}\).
Timing_SVD = function(U,datapoints = ncol(test_data)){
start_time = Sys.time()
correct = 0
wrong = 0
for(i in 1:datapoints){
RecognizedDigit = digit_SVD(as.matrix(test_data[,i]),U)
if (test_label[i]==RecognizedDigit){
correct = correct + 1
}else{
wrong = wrong + 1
}
}
accuracy = correct / (correct + wrong)
end_time = Sys.time()
return(list(time = end_time - start_time, accuray = accuracy))
}
Timing_SVD(U_sampled)
## $time
## Time difference of 1.833941 mins
##
## $accuray
## [1] 0.9597763
Timing_SVD(U_all)
## $time
## Time difference of 1.81714 mins
##
## $accuray
## [1] 0.9597763
Image recognition with SVD is slightly more accurate than least squares, as it has an accuracy of nearly \(96\)%. However, both methods seem to be reliable tools to complete this task accurately. There are a few notes to make about the run-time of this method. First, when comparing \(U_{all}\) to \(U_{sampled}\), there is no difference in accuracy and negligible difference in runtime. However, when comparing both of these methods to the least squares methods, there is a very noticeable difference in run-time. Thus, it seems that the SVD approach is better as it is slightly more accurate and much faster.