How to Perform Regression with more Predictors than Observations

Published on:

April 10, 2023

Published by:

Professor Ishwar Sethi

This post was originally published by one of our partners on:

https://iksinc.tech/how-to-perform-regression-with-more-predictors-than-observations/

A common scenario in multiple linear regression is to have a large set of observations/examples wherein each example consists of a set of measurements made on a few independent variables, known as predictors, and the corresponding numeric value of the dependent variable, known as the response. These examples are then used to build a regression model of the following form:

‍
The equation states that the response yi for the ith input vector or record is a function of d+1 constants (the βs), values of the d predictors for the ith observation, and an error ϵi. The term β0 is known as the bias/intercept. The model is often written in the matrix vector notation as

y = X β + ϵ, where

y is a column vector of N elements, X is a

N \times (d + 1)

matrix of N observations, and ϵ is a column vector of error terms. Note that I have absorbed the bias term in the coefficient vector β and each record now consists of d+1 components with the first component being equal to 1.

The model parameters, βs, are determined by the least squares method to minimize the residual sum of squares (RSS), a measure of difference between the actual and the predicted output. This leads to the following unique solution vector of regression coefficients when we have more observations than features

\tilde{β} = (X^{T} X)^{- 1} X^{T} y

The vector of predicted values is then given by the following expression

\tilde{y} = X \tilde{β} = X (X^{T} X)^{- 1} X^{T} y

This approach to regression using least squares is known as ordinary least squares (OLS) regression. In some regression applications, we have multiple outputs also. In such cases, the term multivariate regression is used and the above model is expressed as

\tilde{Y} = X B + E,

where Y is the N×K response matrix assuming that we have K outputs. The matrix B is the coefficient matrix of size (d+1)×K; each column of B matrix represents a set of regression coefficients for relating one of the outputs to inputs. The matrix E is the N×K matrix of errors. The least squares estimates for regression coefficients in this case are given as

\tilde{B} = (X^{T} X)^{- 1} X^{T} Y

Fewer Observations and more Predictors

While OLS regression is a very popular method, it becomes unworkable when XTX is a singular matrix and its inverse doesn’t exist. This happens when the number of predictors, d, is more than the number of observations, N. The OLS regression approach also becomes unworkable when the predictors are highly correlated resulting in the columns of X matrix being not linearly independent. In such cases, the regression coefficients are unreliable. This situation is known as multicollinearity. Both of above situations are found in many applications in fields such as bioinformatics, chemical and drug analysis, and neuroimaging data analysis.

One potential solution for multicollinearity and fewer observations is to make use of principal component analysis to create a smaller, uncorrelated set of new features and then use them to perform regression. This approach is known as PCA regression (PCR). A drawback of using PCA is that it is meant for preserving the matrix X of predictors measurements. Thus the selected principal components may not be much relevant for predicting the response variables. A better option is to look for those components of X that are more relevant for Y. This is what is done by the method known as the partial least squares (PLS). The method was developed in econometrics about fifty years ago and has been used extensively in chemometrics.

Basic Idea of Partial Least Squares Regression

In PCA, we decompose XTX to determine principal components. In other words, we are working with covariance of the predictors. In PLS, we work with cross covariance by working with XTY to simultaneously decompose X and Y to find components that can explain X as well as Y. The N×d X matrix is decomposed/approximated as X=TPTwith TTT=I, I being an identity matrix. The number of columns in T matrix, known as the X–score matrix, determines the number of extracted components; these components are called latent vectors and serve the same purpose as that of eigenvectors in PCA. Their number determines the accuracy of the resulting regression model. The matrix P is known as the X–loading matrix. Instead of regressing Y on X, it is regressed on T and is approximated as Y=TQT. The matrix Q is called the Y-weights matrix. The steps to determine T, P, and Q are described below using an example. Once these matrices are determined, regression coefficients to predict Y from X are determined.

PLS Regression Steps

There are a few slightly different versions of algorithms for PLS. I am going to describe an iterative algorithm, NIPALS which stands for nonlinear iterative partial least squares, to extract components, the combinations of predictors, in PLS. I will illustrate the steps for performing PLS regression using a small made-up example. The example consists of profiles of three candidates described in terms of their numerical ratings on four traits. This part of the data, three observations and four predictors, forms matrix X. Each candidate profile is evaluated by two experts who give their scores to each applicant. This forms the response matrix Y. The X and Y matrices are given as follows and our objective is to build a regression model to predict scores given by two experts.

Id#Trait1Trait2Trait3Trait4Score1Score2 1 7 7 6 8 7 8 2 5 4 6 5 5 6 3 8 7 5 3 2 4

Step1: Initialize two matrices, E0 = X and F0 = Y. Center and normalize each column of E0 and F0.

Step 2: Form cross product matrix S=XTY and perform its singular value decomposition (SVD). Designate the first left singular vector as w1. The vector w1 forms the first column of a weight matrix, W, which is later used to express regression coefficients to relate Y and X. These steps in R and their results are shown below.

X = matrix(c(7,5,8,7,4,7,6,6,5,8,5,3), nrow=3,ncol=4)Y = matrix(c(7,5,2,8,6,4),nrow=3,ncol = 2)E0 = scale(X)F0 = scale(Y)S0 = t(E0)%*%F0s = svd(S0)w1 = s$u[,1]print(w1)[1] -0.27586226 -0.04227804 0.64530686 0.71111999

Step 3: Using w1, next calculate t1, the first component/column of matrix T. Follow this up by calculating normalized p1 and q1. These steps are shown below.

t1 = E0%*%w1print(t1)[,1][1,] 1.0414819[2,] 0.6281868[3,] -1.6696687p1 = t(E0)%*%t1/colSums(t(t1)%*%t1)# Normalized p1q1 = t(F0)%*%t1/colSums(t(t1)%*%t1)# Normalized q1print(p1)[,1][1,] -0.4489104[2,] -0.2549863[3,] 0.6777327[4,] 0.6019191print(q1)[,1][1,] 0.6604169[2,] 0.6353619

Step 4: Having extracted the first latent vector and corresponding loading vectors, the matrices E0 and F0 are deflated by subtracting information related to the first latent vector. This produces deflated matrices E1 and F1 as shown in the calculations below.

E1 = E0-t1%*%t(p1)F1 = F0-t1%*%t(q1)

Step 5: Calculate the cross product matrix of E1 and F1 as was done in Step 2. With this new cross product matrix, repeat the steps 3 and 4 and save the resulting w,t,p and q vectors to form the next columns of matrices W, T, P, and Q, respectively. This yields the second component. After this, we continue the above steps till the deflated matrices are empty or the necessary number of components have been extracted. Then the algorithm stops.

Carrying out the steps as shown above, we get the following result.

E1 = E0-t1%*%t(p1)F1 = F0-t1%*%t(q1)S1 = t(E1)%*%F1s1 =svd(S1)w2 = s1$u[,1]print(w2)[1] -0.5827933 -0.7163611 0.1092040 -0.3677679t2 = E1%*%w2print(t2)[,1][1,] -1.1766607[2,] 1.3882964[3,] -0.2116356p2 = t(E1)%*%t2/colSums(t(t2)%*%t2)q2 = t(F1)%*%t2/colSums(t(t2)%*%t2)print(p2)[,1][1,] -0.5827933[2,] -0.7163611[3,] 0.1092040[4,] -0.3677679print(q2)[,1][1,] -0.2034234[2,] -0.2874933E2 = E1 - t2%*%t(p2)F2 = F1 - t2%*%t(q2)S2 = t(E2)%*%F2print(S2)[,1] [,2][1,] -3.296958e-18 -4.659511e-18[2,] 6.426732e-17 9.082743e-17[3,] 6.593917e-18 9.319021e-18[4,] 4.998050e-17 7.063622e-17

We notice that the deflated matrices are now empty. So no more components need to be extracted. Now that we have determined matrices T and Q, we can predict Y as Y~=TQT This calculation is shown below.

Y_pred = T%*%t(Q)print(Y_pred)[,1] [,2][1,] 0.9271726 1.000000e+00[2,] 0.1324532 -1.665335e-16[3,] -1.0596259 -1.000000e+00

Well! These predicted numbers do not look right. But, wait. Lets look at centered and normalized Y, i.e. the matrix F0.

print(F0)[,1] [,2][1,] 0.9271726 1[2,] 0.1324532 0[3,] -1.0596259 -1attr(,"scaled:center")[1] 4.666667 6.000000attr(,"scaled:scale")[1] 2.516611 2.000000

So, what we have is predicted values that match well with actual values in a centered and scaled space. However, our interest is in being able to predict Y from X. Thus, we need to get to a relationship that regresses Y against X. This requires some algebraic manipulation wherein we define a new matrix

R = W (P^{T} W)^{- 1}

and use it to relate directly X and Y in the familiar regression equation

R = W (P^{T} W)^{- 1}

, where

B_{P L S} = R Q^{T}

. Since the matrices we are working with were derived after normalizing the data, an adjustment for this is as well needed. Finally, we need to adjust for centering, intercept term, before we can get the final prediction for Y in terms of X. This is done by subtracting the current predicted Y means from the original Y means and the result is added to respective columns to obtain the final predictions. Thus, we can express the final model as

\tilde{Y} = X B_{P L S}^{*}

where matrix X has an additional column of all 1s and

X B_{P L S}^{*}

includes intercept. Calculations for these adjustments are shown below.

R = W%*%solve(t(P)%*%W)# Calculate R matrixB1 = R%*%t(C)# B matrix before adjustmentB = diag(1/apply(X,2,sd))%*%Bs%*%diag(apply(Y,2,sd))# final B matrixY_int_pred = X%*%Br# Intermediate resultprint(Y_int_pred)[,1] [,2][1,] 16.52966 14.07664[2,] 14.52966 12.07664[3,] 11.52966 10.07664intercept = apply(Y,2,mean)-apply(Y_int_pred,2,mean)print(intercept)[1] -9.529659 -6.076640YPred = matrix(c(Y_int_pred[,1]+intercept[1],Y_int_pred[,2]+intercept[2]),nrow=3,ncol=2)print(YPred)[,1] [,2][1,] 7 8[2,] 5 6[3,] 2 4

Finally, we have the prediction which happens to be 100% accurate in this case. Of course, the prediction is being made on the same data that was used to derive the model.

PCR Vs. PLS

Since PCR, principal component analysis followed by OLS regression, is also applicable when we have more predictors than observations or highly correlated predictors, its a good idea to compare PCR and PLS using a real data set. This is what we are going to do in this section using the pls package in R. There is another package, plsdepot, in R as well and one can instead use that. A PLS regression implementation in python is also available in Sklearn library.

I will use the yarn data set available in the pls package. The data set contains 28 near-infrared spectra (NIR) of PET, a type of polyester yarn, measured at 268 wavelengths, serving as predictors and yarn density as response. A plot of spectra for all 28 observations is shown below.

Although the package performs cross validation to select the number of components in the model, I will specify the number of components as 8. That way, we will get results for components ranging from one to eight. This will help us comparing PLS and PCR. The package offers three different cross validation (CV) methods. I will use the leave-one-out (LOO) cross validation as the number of examples is small. The following code shows the steps I followed in building the pls regression model.

library(pls)data("yarn")str(yarn)'data.frame': 28 obs. of 3 variables:$ NIR : num [1:28, 1:268] 3.07 3.07 3.08 3.08 3.1 .....- attr(*, "dimnames")=List of 2.. ..$ : NULL.. ..$ : NULL$ density: num 100 80.2 79.5 60.8 60 ...$ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...> yarn.pls <- plsr(density ~ NIR, 8, data = yarn, validation = "LOO")Data: X dimension: 28 268Y dimension: 28 1Fit method: kernelplsNumber of components considered: 8VALIDATION: RMSEPCross-validated using 28 leave-one-out segments.(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 compsCV 27.46 4.600 3.900 2.090 0.7686 0.5004 0.4425 0.2966 0.2643adjCV 27.46 4.454 3.973 2.084 0.7570 0.4967 0.4398 0.2926 0.2610TRAINING: % variance explained1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 compsX 46.83 98.38 99.46 99.67 99.85 99.97 99.98 99.99density 98.12 98.25 99.64 99.97 99.99 99.99 100.00 100.00

We note that there is not much difference in coefficient values using CV or adjusted CV estimates. The model summary shows that the first two components are able to explain over 98% of the variance in predictors and the response. This is also made clear when we look at the RMSEP (Root Mean Square Error of Prediction) for different number of components. With four components, the RMSEP value is 0.41936.

RMSEP(yarn.pls, newdata = yarn)(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps26.47630 3.62801 3.50180 1.59352 0.41936 0.29223 0.25136 0.10932 0.08832

Now, lets try to fit a PCR model by specifying the same number of components.

yarn.pcr <- pcr(density ~ NIR, 8, data = yarn, validation = "LOO")summary(yarn.pcr)Data: X dimension: 28 268Y dimension: 28 1Fit method: svdpcNumber of components considered: 8VALIDATION: RMSEPCross-validated using 28 leave-one-out segments.(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 compsCV 27.46 30.82 4.276 2.558 2.612 1.643 0.4891 0.5288 0.3033adjCV 27.46 31.43 4.264 2.547 2.618 1.609 0.4852 0.5258 0.2994TRAINING: % variance explained1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 compsX 52.17 98.60 99.47 99.70 99.88 99.97 99.98 99.99density 5.50 98.15 99.40 99.58 99.95 99.99 99.99 100.00> RMSEP(yarn.pcr, newdata = yarn)(Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps26.4763 25.7380 3.6016 2.0592 1.7244 0.5972 0.2864 0.2582 0.1262

Looking at the PCR model, we find that the RMSEP error value for four components is 1.7244. This means that PCR requires more components to achieve a performance level similar to PLS. Even with eight components, PLS gives slightly lower error than PCR.

Summary

Let me summarize what I have discussed thus far.

Both PCR and PLS are able to perform regression when you have highly correlated predictors or more predictors than the number of examples.
PLS requires fewer components (latent vectors) compared to PCR to achieve the same level of prediction error. This is due to the fact that PLS is based on cross-variance while PCR is based on variance.
Another way to look at the difference between PLS and PCR is that PLS can be viewed as a supervised method of dimensionality reduction for regression while PCR is an unsupervised method for dimensionality reduction.
PLS can fit a single model to a multiple response regression problem unlike the ordinary regression where a separate model will be fitted for individual response variables.
Finally, there is another approach to selecting predictors known as the shrinkage method. The most well known example of shrinkage methods is ridge regression. In shrinkage methods, none of the predictors are discarded and no predictor combinations are formed. Rather, an additional term in the regression process is introduced which constraints the regression coefficient magnitudes. PCR, PLS and ridge regression behave similarly. However, ridge regression is preferred because of better interpretation.

My simplified example is based on the following paper that you may want to consult for details.

‍

Check Out These Brilliant Topics

Artificial Intelligence (AI) Fundamentals

Explore the foundations of AI and its various applications in today's world. Our blog covers essential concepts, algorithms, and techniques that form the basis of AI, providing students and educators with a solid understanding of this transformative technology and its potential impact across industries.

Machine Learning Essentials

Uncover the potential of machine learning as a powerful tool for data-driven decision-making. Learn about the core principles, algorithms, and techniques that underpin machine learning, and explore practical examples and applications that demonstrate its capabilities in solving real-world problems.

Deep Learning for Advanced Analytics

Delve into the world of deep learning and its applications in advanced analytics. Our blog explores the fundamentals of deep learning, including neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs), as well as their applications in areas such as image and speech recognition, natural language processing, and autonomous systems.

Clustering Techniques for Data Analysis

Discover the power of clustering techniques in analyzing and understanding complex datasets. Our blog dives into the applications of clustering algorithms in various domains, such as pattern recognition, image segmentation, and customer behavior analysis, offering valuable insights for students and researchers alike.

Natural Language Processing (NLP) in the Digital Age

Explore the role of NLP in transforming human-computer interactions and enabling seamless communication between people and machines. Our blog showcases the core concepts and techniques of NLP, including text processing, sentiment analysis, and machine translation, providing students and educators with a comprehensive understanding of this fascinating field.

Sit back Let's learnAbout