# functions for numerical analysis
source("https://drive.google.com/uc?export=download&id=10dNH3VbvxS8Z3OHjP4i9gRbtsf91VVBb")
# load the data
source("https://drive.google.com/uc?export=download&id=14EU7jxS8LcQ29WOvzAevXigl2wloccqF") 
require(expm)
require(pracma)

Description of Data

The following project provides multiple machine learning approaches to recognize handwritten digits (0-9) using Least Squares and the SVD. Here is a description of the data we are using to do this:

train_data: a \(256 \times 4649\) matrix where each column is a \(16 \times 16\) image that has been flattened into a \(256 \times 1\) vector. Note that the data has been sorted, starting with images corresponding to 0, then 1, then 2, up to 9.

indices: a \(10 \times 1\) vector which indicates the index where each new digit in train_data starts.

train_data_1000: a \(256 \times 1000\) matrix that stores 100 samples of each digit in the columns, sorted in the order from 0 to 9.

test_data: a \(256 \times 4649\) matrix on which we would like to predict the digit

test_label: a \(4649\)-dimensional vector revealing the true label for each of the columns in test_data (this allows us to compare how accurate our prediction is).

Note that it is assumed that the reader is familiar with common techniques of numerical analysis and computational linear algebra including the Normal Equations, Least Squares, the QR algorithm and its variants as well as the Singular Value Decomposition (SVD). For background on these topics, Timothy Sauer’s “Numerical Analysis” is recommended.

Part I: Investigating the Data

To investigate the data a bit, we start by extracting the columns of train_data corresponding to the first three values in indices. Then, we reform each column as a \(16 \times 16\) matrix using the imPlot command in imported above to view an image of the transpose of each matrix. The result is images of the digits 0, 1 and 2.

# save the three matrices as img0, img1, and img2 
img0 <- matrix(train_data[,indices[1]], nrow = 16)
img1 <- matrix(train_data[,indices[2]], nrow = 16)
img2 <- matrix(train_data[,indices[3]], nrow = 16)

# plot the images
imPlot(t(img0))

imPlot(t(img1))

imPlot(t(img2))

Part II: Least Squares

First we explore the method of least squares for classification. For each digit between 0 and 9, we will set up a system of equations

\[ A_0x=b, \quad A_1x=b, \quad \ldots, \quad A_9 x = b, \]

where \(A_i\) corresponds to the flattened images of the \(i\)th digit in the training data. Then, for each new flattened image \(b\) to be classified in test_data, we solve the system of equations \(A_ix=b\) to find \(x^*\), determine \(\hat{b}=A_ix^*\), and compute the error as the norm of the residual. The target vector \(b\) should be classified as digit \(i\) with the smallest error.

There are a few important notes regarding how we should use least squares in this context. First, we cannot use least squares on the full training data since the point of employing the method is to distinguish or characterize the various digits. For example, if \(A_1\) is constructed using all of the training data, then any new training data that is projected into the span of its columns using least squares will always give a resulting residual vector with small magnitude. In other words, we won’t have any means of effectively training the model.

Second, note that each linear system of the form \(A_ix = b\) is underdetermined. Equivalently, each matrix \(A_i\) corresponds to a linear system with more unknowns than equations, or equivalently more columns than rows. This means that each \(A_i\) cannot be a surjective mapping, implying that least-squares is a good approach to solve this problem. To make this more explicit, consider the following code chunk:

diffs <- c()
k <- length(indices)-1
for(i in 1:k) { 
  diffs[i] = indices[i+1]-indices[i]
}
print(diffs)
##  [1] 767 622 475 406 409 361 420 390 377 422

The above code determines how many columns of the full training data correspond to each digit. Since each matrix \(A_i\) will have exactly \(256\) rows, we see that each matrix \(A_i\) corresponds to a linear system with more unknowns than equations.

We will use train_data_1000 to perform least squares. Below we include three implementations of least squares including the Normal Equations, QR decomposition and via the SVD. Each method will determine, given a target vector/digit, a minimum residual vector. Note that for each approach we must first partition the sampled test data. The Normal Equations approach is straightforward, and just uses R’s solve() method. The QR approach leverages the magnitude-preserving properties of orthonormal matrices in order to compute the magnitude of the residual vector. In particular, that

\[ ||Ax^* - b||_2 = ||\hat{Q}^{\top}b||_2 \]

when we use the full QR given by \(A = \overline{Q} \ \overline{R}\). For the SVD approach, we again leverage the nice properties of orthonormal matrices to compute the magnitude of the residual without actually computing the residual vector. In particular, that given the full SVD \(A = U \Sigma V^{\top}\) then

\[ ||Ax^* -b||_2 = \sqrt{\sum_{i=r+1}^m c_i^2} \]

where \(c = U^{\top}b\) and \(r\) corresponds to the number of singular values. In this case we have \(r = 100\). Here is the code for all of this:

## b is a sample test vector to be classified
## data is the training data to build all of the systems of equations for each digit (i.e. train_data_1000)

## Least Square with the Normal equations

digit_NormLS= function(b,data=train_data_1000){
  bestError = 1000000 
  digitrec = 100 
  Ind <- seq(1,1000,by=100)
  for(i in (1:10)) { 
    # get the data 
    k <- Ind[i]
    range <- c(k:(k+99))
    A <- data[,range]
    # normal equations 
    M <- t(A) %*% A
    v <- t(A) %*% b 
    # compute residual 
    x.star <- solve(M,v)
    r <- vnorm(b - A%*%x.star)
    # update variables 
    if(r < bestError) { 
      bestError <- r
      digitrec <- i-1
    }
  }
  return(digitrec)
}

## Least Squares with QR Decomposition

digit_QRLS= function(b,data=train_data_1000){
  bestError = 1000000 
  digitrec = 100 
  Ind <- seq(1,1000,by=100)
  for(i in (1:10)) { 
    # get the data
    k <- Ind[i]
    range <- c(k:(k+99))
    A <- data[,range]
    # full QR with R's qr function 
    out <- qr(A)
    Q.bar <- qr.Q(out,complete=TRUE)
    n <- ncol(qr.Q(out))
    m <- ncol(Q.bar)
    Q.hat <- Q.bar[,((n+1):m)]
    # compute residual 
    r <- vnorm(t(Q.hat)%*%b)
    # update variables 
    if(r < bestError) { 
      bestError <- r
      digitrec <- i-1
    }
  }
  return(digitrec)
}

## Least Squares via the SVD 

digit_SVDLS= function(b,data=train_data_1000){
  bestError = 1000000 
  digitrec = 100 
  Ind <- seq(1,1000,by=100)
  for(i in (1:10)) { 
    # get the data
    k <- Ind[i]
    range <- c(k:(k+99))
    A <- data[,range]
    # full SVD with R's built-in function
    out <- svd(A,nu=256,nv=100)
    c <- t(out$u) %*% b
    # compute residual 
    r <- sqrt(sum(c[101:256]^2))
    # update variables 
    if(r < bestError) { 
      bestError <- r
      digitrec <- i-1
    }
  }
  return(digitrec)
}

# tests
digit_NormLS(test_data[,1],train_data_1000)
## [1] 0
digit_QRLS(test_data[,1],train_data_1000)
## [1] 0
digit_SVDLS(test_data[,1],train_data_1000)
## [1] 0
digit_NormLS(test_data[,4321],train_data_1000)
## [1] 4
digit_QRLS(test_data[,4321],train_data_1000)
## [1] 4
digit_SVDLS(test_data[,4321],train_data_1000)
## [1] 4

Part III: Accuracy and Runtime

The code chunk below determines the accuracy and runtime for each of the three least squares methods implemented above using test_data.

Timing = function(method){
  start_time = Sys.time()
  correct = 0
  wrong = 0
  for(i in 1:(ncol(test_data))){
    RecognizedDigit = method(as.matrix(test_data[,i]))
    if (test_label[i]==RecognizedDigit){
      correct = correct + 1
    }else{
      wrong = wrong + 1
    }
  }
  accuracy = correct / (correct + wrong)
  end_time = Sys.time()
  return(list(time = end_time - start_time, accuracy = accuracy))
}
 
Timing(digit_NormLS)
## $time
## Time difference of 1.712391 mins
## 
## $accuracy
## [1] 0.9150355
Timing(digit_QRLS)
## $time
## Time difference of 9.534597 mins
## 
## $accuracy
## [1] 0.9150355
Timing(digit_SVDLS)
## $time
## Time difference of 15.29439 mins
## 
## $accuracy
## [1] 0.9150355

Part IV: Singular Value Decomposition

Now we use the SVD for image recognition by constructing a “best basis” for the images of the digits. For each digit \(i\), we compute the best \(J\)-dimensional basis \(U_J^{(i)}\) corresponding to digit \(i\) by using train_data_1000 to build the matrix \(U_{J}^{(i)}\) containing the best J-dimensional basis for each \(i\). The matrices for each of these bases are column binded together into \(U_{sampled}\). (Note: After experimenting with a few different choices, \(J=17\) was chosen as it gives good classification rates.) Thus, \(U_{sampled}\) is of size \(256 \times 170\).

# matrix to construct 
U_sampled <- matrix(data=NA,nrow=256,ncol=170)

# indices for the test data and U_sampled matrix  
Ind <- seq(1,1000,by=100)
U.ind <- seq(1,170,by=17)

for(i in (1:10)) { 
  # get the data 
  j <- Ind[i]
  range <- c(j:(j+99))
  A <- train_data_1000[,range]
  # compute J-dimensional basis with SVD
  UJi <- svd(A)$u[,(1:17)]
  # append to U_sampled
  j <- U.ind[i]
  range <- c(j:(j+16))
  U_sampled[,(range)] <- UJi
}

Using the bases we constructed above, we now write a function called digit_SVD which takes as input a target vector \(b\) and the matrix \(U_{sampled}\). We orthogonally project \(b\) into each basis \(U_J^{(i)}\) using the projection matrix

\[ \mathbf{P} = \sum_{i = 0}^J u_iu_i^{\top}. \]

Thus, our projection into each basis is given by \(\mathbf{P}b\), and by the nice properties of orthonormal matrices, the magnitue of the residual (i.e the error) is given by

\[ ||(I-\mathbf{P})b||_2 = ||b - \mathbf{P}b||_2. \] The target vector \(b\) should be classified as digit \(i\) with the smallest error. Here is the code to do this:

digit_SVD = function(b, UJ){
  bestError = 1000000
  digitrec = 100
  # indices for UJ
  U.ind <- seq(1,170,by=17)
  for(i in (1:10)) { 
    # get UJi
    j <- U.ind[i]
    range <- c(j:(j+16))
    UJi <- U_sampled[,(range)] 
    # projection matrix
    P <- matrix(data=0,nrow=256,ncol=256)
    for(k in (1:17)) { 
      P <- P + (UJi[,k] %*% t(UJi[,k]))
    }
    # compute residual 
    r <- vnorm(b - P%*%b)
    # update variables 
    if(r < bestError) { 
      bestError <- r 
      digitrec <- i-1
    }
  }
  return(digitrec)
}

# tests
digit_SVD(test_data[,1],U_sampled)
## [1] 0
digit_SVD(test_data[,4321],U_sampled)
## [1] 4

In this case, we can use the full training data to build the matrix \(U_{J}^{(i)}\) containing the best J-dimensional basis for each \(i\). We column bind these together into \(U_{all}\), and again choose \(J=17\). Thus, \(U_{all}\) is of size \(256 \times 170\).

# matrix to construct 
U_all <- matrix(data=NA,nrow=256,ncol=170)

for(i in (1:10)) { 
  # get the data 
  j <- indices[i]
  jj <- j + diffs[i] - 1
  range <- c(j:jj)
  A <- train_data[,range]
  # compute J-dimensional basis with SVD
  UJi <- svd(A)$u[,(1:17)]
  # append to U_sampled
  j <- U.ind[i]
  range <- c(j:(j+16))
  U_sampled[,(range)] <- UJi
}
  
# tests
digit_SVD(test_data[,1],U_all)
## [1] 0
digit_SVD(test_data[,4321],U_all)
## [1] 4

Part V: Accuracy and Runtime

Now we test the accuracy and runtime for the SVD model constructed using \(U_{sampled}\) and \(U_{all}\).

Timing_SVD = function(U,datapoints = ncol(test_data)){
  start_time = Sys.time()
  correct = 0
  wrong = 0
  for(i in 1:datapoints){
    RecognizedDigit = digit_SVD(as.matrix(test_data[,i]),U)
    if (test_label[i]==RecognizedDigit){
    correct = correct + 1
    }else{
      wrong = wrong + 1
    }
  }
  accuracy = correct / (correct + wrong)
  end_time = Sys.time()
  return(list(time = end_time - start_time, accuray = accuracy))
}

Timing_SVD(U_sampled)
## $time
## Time difference of 1.833941 mins
## 
## $accuray
## [1] 0.9597763
Timing_SVD(U_all)
## $time
## Time difference of 1.81714 mins
## 
## $accuray
## [1] 0.9597763

Part VI: Conclusion

Image recognition with SVD is slightly more accurate than least squares, as it has an accuracy of nearly \(96\)%. However, both methods seem to be reliable tools to complete this task accurately. There are a few notes to make about the run-time of this method. First, when comparing \(U_{all}\) to \(U_{sampled}\), there is no difference in accuracy and negligible difference in runtime. However, when comparing both of these methods to the least squares methods, there is a very noticeable difference in run-time. Thus, it seems that the SVD approach is better as it is slightly more accurate and much faster.