Class 4 – ML Log: Pseudo-inverse estimator and Gradient Descent approaches to Linear Regression

Posted by

The actuaries beginning to get acquainted with AI/ML find it a bit daunting. A couple of reasons for this. In day-to-day work, actuaries use the results of statistical analyses, like standard rate tables, and spend less time deriving the regression models. Over the period, the rigor of knowledge gets rusted. The other reason is that the AI/ML field sharply focuses on a specific, usually large, dataset. In contrast, the actuarial field deals with a diverse set of factors, like demographic events like death or illness, man-made or natural risk events, operational expenses, forward interest rates, etc. Actuarial focus is more on complex interactions within well-defined datasets and financial outcomes. 

Actuaries learn linear regression early on; the CS1 subject covers correlation analysis to assess the strength of the relationship between variables. In contrast to ML/AI structure, the CS1 starts defining uncorrelated error functions ei at the outset. Then, without explicitly mentioning the loss function, the least squares method is used to numerically derive the regression parameters in terms of variances and covariances. This approach works for a smaller set of known parameters but becomes tedious with datasets with very many features. For example, image classification can easily contain tens of thousands of pixels as inputs. 

This blog details the alternative approach adopted by the AI/ML field. More specifically, the general formulation of linear regression includes basis functions that help to model non-linear relations (linear regression refers to linearity only with respect to regression coefficients). It is crucial to understand the matrix formulation of the problem; the matrix operations can involve a massively packed sample set, with each observation containing a large number of features and parameters to train.


This is part of the learning log from the 3rd module, Machine Learning Basics for Real-World, from the certification course on Digital Health and Imaging. Links to the module learnings are below:


Learning Summary – Class 4; 2 July 2022

  • Linear Regression – One dimensional 
  • Getting the ‘Best’ – Enter the loss function
  • Linear Regression as Optimization Problem
  • Scaling up: General formulation of Linear Regression
  • Linear Regression training model representation in matrices
  • Learning through cost optimization – squared error
  • Pseudo-inverse estimator – the need for regularization
  • (Alternative) Gradient Descent Solution for Least Squares

Linear Regression

Linear regression has been a statistical model for quite a long time. Many variations have been tried, with entirely different names. Linear regression assumes a linear relationship between the input variables (x) and the single output variable (y). The method is referred to as simple linear regression when there is a single input variable (x).

As we saw before, a loss function is used as a guide to train a machine learning model from sample data. The most common method for linear regression has been ordinary least squares.  

Linear Regression – One dimensional

We have N data samples of input and response pairs; the data given to us is (x1, y1), (x2, y2), . . . , (xN , yN ). The input data dimension is just 1. For finding out a linear regression model, we assume that the input and response relationship is linear (hopefully!).

Our problem statement is:

Given data {(x1, y1), (x2, y2), . . . , (xN , yN ); find a straight line that best fits this set of points.

We can rephrase this as; given a set of all straight lines F, choose a straight line that best fits this set of points

We can restrict F to a set of all straight lines that are passing through the origin. This is reasonable because, with some pre-processing, we can transform the data.

Hence; define F as F = {fw(x) = wx : w ∈ ℝ }

Or F is parameterized by w; our aim hence is to learn w from the given data.

Getting the ‘Best’ – Enter the loss function

As seen in the last class, the general machine learning approach uses loss function to determine empirical loss. A loss function takes two inputs: the response given by the model and the corresponding ground truth (to estimate the cost of the decisions made from the model).  

The least squared loss function is defined as:

\mathbf{L (f\left ( w \right )) = \sum_{n=1}^{N}\left ( y_{n} - f_{w}\left ( x_{n} \right ) \right )^{2}}

  • (yn − fw(xn) ) is per sample loss
  • L(fw) is the total loss

Linear Regression as Optimization Problem

With loss function, our problem statement becomes

Find f in F that minimizes L(f); or equivalently

Find w ∈ ℝ that minimizes empirical risk L(fw).

A solution to this problem is given by dL/dw = 0; the least squared function is a neat convex differentiable function, providing a closed form of solution.

First, we will calculate the derivative of L w.r.t w. and then set it to zero (gradient is 0 at minimum for a convex function):

Hence, we get the solution in terms of observed data that provides a minimum value for the least squared error. Note that this is on the observed sample only; we don’t know how this will perform for the test data or any new observations.

Scaling up: General formulation of Linear Regression

In real life, x will have many more than a single feature. In these cases, we can extend the line concept to a hyperplane space. This is where thinking in terms of matrices helps. Matrices help to present the equations with very many variables in an elegant form.

We can frame the problem as:

Given a training data D = {(xn, yn)}Nn=1, where xn ∈ ℝD is input in D dimensions (or features) and yn ∈ ℝ is the response in one dimension or a real value.

A straightforward adaptation of a single feature case to a D-dimension input x is F = {fw(x) = wT x : w ∈ ℝ } where wT = [w1, . . . , wm]. Another common representation is to denote the set of functions F as a set of hypotheses or H

This simpler model is linear in both the parameters and the input variables, which limits it from adapting to nonlinear relationships. Linearity is often a reasonable assumption when many inputs influence the output, but generally, it is unlikely that a proper function is linear. Similarly, assuming that the classification boundaries are linear hyperplanes is often unreasonable.

The above model can be augmented by replacing the input variables with nonlinear basis functions of the input variables. For example, we can expand H to be the set of functions that are linear combinations of a set of basis functions

φ1 , . . . , φM : ℝD → ℝ

Note that each of φi is a mapping from the D dimension into a real number or 1 Dimension.

I initially struggled with this concept, especially comprehending the change in the input dimensions. Some examples from here and here will help to clarify.

Example 1 – Consider input vector: 1, x1, x2, we want to model nonlinear interaction term x1x2 into the model.

Our solution could be: f(x) = 1−2x1−2x2+4x1x21(x)−2φ2(x)−2φ3(x)+4φ4(x)

with φ1(x) = 1, φ2(x) = x1, φ3(x) = x2, and φ4(x) = x1x2

Example 2: A Polynomial Basis Function for the above vector could be: {1, x1, x2, x3, x1x2, x1x3, x2x3, x12, x22, x32}

Example 3: Consider a real values input x ∈ ℝ (input in 1-D).

Define the basis functions: φ1(x) = x, φ2(x) = x2, …, φM(x) = xM.

Then h ∈ H is of the form Mh(x) = w0 + w1x + w2x2 + … + wMxM ; which means that H is the set of all polynomials in x with a degree no greater than M.

The basic idea is simple: in addition to the original inputs, we add inputs that are calculated as deterministic functions of the existing inputs and treat them as additional inputs.

Independent of the choice of basis functions, the regression parameters are calculated using the well-known equations for linear regression.

Mathematically we define the feature mapping φ: ℝD → ℝM by

So H is the set of linear functions of φ(x). To summarize,

where φ is a fixed nonlinear vector-valued function and w is a learned linear function. Then h = w ◦φ.

Our linear model is (please note reverting to f here from H, to be consistent with my course notes):

Where wj are the model parameters and φj is the basis function (changes the representation of x)

or

y = b + wTφ(x), where wT = [w1, . . . , wm] φT = [φ1, . . . , φm]


Heads-up- this is the part I struggled with most; it always left a feeling of not knowing enough to be confident. I had been putting off revisiting my rusty linear algebra, especially the matrices operations. In most machine learning implementations, matrices representation makes it convenient. In neural networks especially, one needs to be comfortable with the matrices representations.


Linear Regression training model representation in matrices:

Let us simplify the model by setting φ(x) = xi and d = m. You might recognize this as a straightforward adaption of linear regression mentioned above; from 1-dimensional input to D-dimensional input.

Model: y = b + wTx for an observation xn D

Suppose we have n observations in our training data (or samples), each with a D dimension (or features). Then, we can represent the above model applied to n number of samples compactly through matrices.

Note the dimensions of the X matrix – n observations (rows) with each sample having D features (columns). The ground truth is packed in an n rows column matrix; each of the outcome observations. So, we get the set of predicted values in a compact form:

Y = XW + b

Note b, it is the column vector of bs from n predictions; moving b into the first term is convenient. Effectively, this can be added as an extra dimension to the samples leading to D+1 columns. 

Now our problem is solving the linear equation system, given Y and X, find W that satisfies Y = XW.
A solution is not guaranteed, W that satisfies yn = wTxn, n = 1, 2, . . . , N may not exist.

Learning through cost optimization – squared error

As we saw earlier, the regression objective is to learn by optimization of the cost function. Since we do not know the underlying distribution P, we try to minimize the average loss over the observations, called Empirical Loss. Using the least square loss function, our objective is:  

Given data \mathbf{\left \{ \left ( x_{n}, y_{n} \right ) \right \}_{n=1}^{N}}, find w such that \mathbf{L_{emp} = \sum_{n=1}^{N} \left ( y_{n}- w^{T}x_{n} \right )^{2}} is minimum.

The squared error loss function provides a convex, differentiable function. We can find w by differentiating the Lemp with respect to w, and setting it to zero.

This leads to our final solution:

Honestly, this is the juncture where I start thinking about my priorities in life, taking out my dog or checking on my younger son to see if he is picking up maths alrighty; come on, look at that equation! How inspiring! This time too, I got stuck for about two days, trying to figure out how this half-line formula could contain so much data.

Two days of trawling through resources did pay off. I think I have finally come to terms with this formula! If you struggle like me, here are quick references. 

  1. Need to keep all steps together that lead to packing so much into this result; this video has a great explanation and simpler visual connections. Has a good refresher on matrix operations too.
  2. The following part explains the derivations of the formula above, leaned that this approach is called The Pseudoinverse Estimator.
  3. This course note has been such a relief. However hard I tried through the sequences, there was still a feeling of disconnect. This note goes on to work out a bivariate regression; it couldn’t have been a better explanation! If you wish, you can access more course notes here.
  4. Additional reference – this note takes a set of values and works out the matrices.

The above solution is a one-step learning solution or pseudo-inverse estimator. The pseudo-inverse refers to (XTX)-1, which needs only the input matrix X, and we can recover the optimal parameter vector. Note this is a closed-form analytical solution.

XTX is a d × d matrix (where d is the dimension of the data); as the number of features grows large, it can be computationally very expensive to invert.

The need for regularization

Another drawback with this numerical closed form solution is that it uses every information in, hence wis in W = [b, w1, . . . , wd] can become very large. Large wi values mean the model turns out to be very complicated and hence overfits the data. Hence, we will need to penalize the large wi values by regularization.

With L2 (or least squared) regularization, our linear model becomes:

And the closed-form solution is:

This regularization helps as the individual components of w remain smaller, or in other words, do not learn a simple feature with too much importance. Regularization is very important when N is small, and D is very large.

One more advantage of Regression: If XTX is not invertible (hence the numerical solution isn’t possible), one can make (XTX + λId) invertible.

Gradient Descent Solution for Least Squares

In the case of high-dimensional data, it is prohibitively difficult to compute the inverse of the feature matrix. Instead, we turn to a gradient descent method. In this method, we start with a random set of features, then find out the slope of the empirical loss function with respect to each of the parameters. The direction of the slope tells us (hopefully) what adjustment needs to be made to minimize the loss.

The risk, however, is that we may be stuck in a local optimum in a multidimensional space.

1 Start with an initial value w = w (0)

2 Update w by moving along the gradient of the loss function L(Lemp or Lreg)

for squared error loss or L2 loss function this becomes:

3 Repeat until convergence.

The squared loss function in linear regression is convex, if we add an L2 regularize, then the resulting function is strictly convex.


Disclaimer:

I currently work full-time at Swiss Re, Bengaluru. The blogs and articles on this website www.balajos.com are the personal posts of myself, Balachandra Joshi, and only contain my personal views, thoughts, and opinions. It is not endorsed by Swiss Re (or any of my formal employers), nor does it constitute any official communication of Swiss Re.

Also, please note that the opinions, views, comprehensions, impressions, deductions, etc., are my takes on the vast resources I am lucky to have encountered. No individuals or entities, including the Indian Institute of Science and NSE Talent Sprint who have shown me where to research, or the actuarial professional bodies that provide me continuous professional growth support, are responsible for any of these views; and these musings do not by any stretch of imagination represent their official stands; and they may not subscribe/support/confirm any of these views and hence can be held liable in any vicarious way. All the information in the public space is shared to share the knowledge without any commercial advantages.