Even though **actuaries** practice supervised learning, it feels **disconnected from the emerging AI/ML field**. Actuaries are trained to learn from data; it is supervised learning in the sense that burning costs and profitability are the primary drivers of the learning. This learning comes from specific datasets, characteristics of which are relatively well-understood. In AI/ML problems, the challenges aren’t this well defined. Moreover, unlike many interacting factors in actuarial work, the learning is from a specific dataset.

This part of the blog outlines supervised learning and explains the popular regression methods. More importantly, **learning by optimization** is detailed, where optimization is through a cost function. The concepts of regression, loss function, and regularization aren’t new (learned in CS subjects). What is different is the articulation of empirical loss and beginning to comprehend the curse of dimensionality.

Read on for a **rigorous definition of supervised learning** from a dataset, caveats, and challenges that ML practitioners need to keep in mind. Young actuaries must be aware of** these paradigms** as we will be stepping out of our well-defined workspace into a connected, data-driven world.

This is part of the learning log from the 3rd module, Machine Learning Basics for Real-World, from the certification course on Digital Health and Imaging. Links to the module learnings are below:

- Class 1 – Machine Learning – An actuary’s learning log
- Class 2 – ML Log – Learning Methods & Bayesian Decision insights
- Class 3 – ML Log – Rigorous explanation of Supervised Learning Approach
- Class 4 – ML Log – Pseudo-inverse estimator and gradient descent approaches to Linear Regression
- Class 5 & 6 – ML Log – Likelihoods and Bayesian Philosophy in Machine Learning

**Learning Summary – Class 3; 2 July 2022**

- Supervised Learning – Stating supervised learning suitably for maths and Mathematical Setting
- The objective of supervised learning
- Regression – note on linearity
- Popular regression methods – Simple/multiple linear regression, Polynomial regression, Bayesian linear regression, Support vector regression, Gaussian process regression
- Classification – Popular classification techniques
- Supervised Learning: Formal Definition – Dimension and the curse of dimensionality
- Foundational aspects of machine learning
- Loss Function
- Learning as an optimization – Empirical Risk
- Key takeaways – Learning as the optimization – What have we learned?

**Supervised Learning**:

We saw supervised learning involves developing models that learn a mapping between input examples and the target variable. The training data includes the input vectors and their corresponding result/ground truth vectors. In this case, learning is discovering an algorithm or a mapping that predicts the results on new examples beyond the training set.

Most MOOC courses and lighter introductions to AI/ML go a bit lighter on mathematical rigor. For example, you most likely will see a linear regression line with distances above and below marked very early and jump to squares of those distances to be minimized. This approach is more straightforward, but as I learn through this course, I realize that good mathematical rigor is necessary to get comfortable with the multidimensional world of mappings. The same goes for linear algebra; I found it hard to think of large data being packed and manipulated in matrices. My other growing interest is quantum computing; linear algebra feels even more intuitive there with probabilistic state-space and derived gates from those spaces.

**Stating supervised learning suitably for maths**:

Input for supervised learning is feature vector ** x**, consisting of observed values of features; or measurable characteristics (numerical like the temperature or categorical like yes/no). For each input, an observed and recorded target feature or response is noted for a phenomenon of interest. This response is also called ground truth and is represented by

**. We are interested in this phenomenon to predict given new observations of**

*y***.**

*x*- If
takes only two values (like tumor positive/tumor negative) or at most finitely many values (types of flowers), it is a*y*problem.*classification* - If
takes any real number (like temperature or stock price), it is a*y*problem.*regression* - The aim is to build a system
(or a function) in such a way that given*f*, predict*x*as accurately as possible.*y*

**Supervised Learning: Mathematical Setting**

**Input**: A set of labelled training examples *D = { (x _{1}, y_{1}), (x_{2}, y_{2}), . . . , (x_{N} , y_{N} ) }*

- Each
can be an image or a document or a time series.*x*_{n} - Each
itself is a*x*_{n}-dimensional vector.*D* - That is each
is of the form*x*_{n}*x*_{n}= (x_{n1}, x_{n2}, . . . , x_{nD}) are called features of*x*_{n1}, x_{n2}, . . . , x_{nD}*x*_{n}

Each ** x_{n}** is represented either as

**or**

*a vector***. Note the terminology, a row vector and a column matrix are equivalent ways of representing a series of numbers. I overlooked this notation initially; it took some time to comprehend further matrix operations.**

*a column matrix***Output**: ** y_{n}** denotes a label or ground-truth or response.

**The objective of supervised learning**

Stated mathematically, supervised learning is to discover or learn a function ** f(θ)** that:

- Closely
**mimics**the examples in the training set, i.e., has low training error.*( f*_{θ}(x_{n}) ≈ y^{(n)}) - Generalizes to unseen examples, i.e., has low test error

** θ** refers to

**learnable parameters**of the function

**. The distinction between a parameter and a variable is subtle; see here for a shorter explanation.**

*f(θ)*A function in math is a rule, giving a unique output for every input x. Mapping or transformation denotes a function in math. The set of all the values that can be input into the function is defined as a domain. The range (or co-domain) is the set from all the values that come out as the output.

**Regression**

The objective of Regression is to learn a function mapping input features ** x** to

**scalar**target

**. In practice, a model is first selected by the researcher, and a method to find the best parameter is used (e.g., ordinary least squares) to estimate the parameters of that model.**

*y*A more straightforward form of ** f_{θ }**is assumes that it is linear in

**.**

*θ*Examples are predicting:

- The temperature in a room based on other physical measurements
- location of gaze using an image of an eye
- the remaining life expectancy of a person based on current health records
- return on investment based on the market status

**Note on linearity**:

A linear equation is an equation of the form ** a_{1}x_{1} + … + a_{n}x_{n} + b =** 0, where

*x*_{1}, …,**are variables (or unknowns), and**

*x*_{n}**are the coefficients (or parameters), which are often real numbers.**

*a*_{1}, …, a_{n}In the case of two variables, each solution is the coordinates (Cartesian) of a point of the plane (Euclidean). The solutions of a linear equation form a line in the plane; conversely, every line can be viewed as the set of all solutions of a linear equation in two variables, hence the term linear for describing this type of equation.

More generally, the solutions of a linear equation in ** n** variables form a hyperplane (a subspace of dimension

**) in the Euclidean space of dimension n.**

*n − 1***Popular regression methods **

If a single independent variable is used linearly, it is **Simple Linear Regression**. If more than one independent variable is used in linear form, it is called **Multiple Linear Regression**.

**Polynomial regression: **In linear regression, relations concerning multiple values are of interest. For weight loss, hours spent at the gym, sugar intake, and so on can be considered. In the case of *polynomial *regression, the impact of multiple different powers of one variable is considered. (** x, x^{2}, x^{3}**, and so on, where

**is the age in predicting mortality that most actuaries are familiar with).**

*x*Polynomial regression models are prone to overfitting. With parameters, one can fit anything. As John von Neumann reportedly said: *“with four parameters, I can fit an elephant; with five, I can make him wiggle his trunk.”*

**Bayesian linear regression **formulates linear regression using probability distributions rather than point estimates. The response, ** y**, is assumed to be drawn from a probability distribution instead of a single value. The aim is to determine the posterior distribution for the model parameters.

**Support vector regression** allows the flexibility to set the acceptable error is acceptable in the regression model to find an appropriate line (or hyperplane) to fit the data.

**Gaussian process regression**: Gaussian process regression is nonparametric (i.e., not limited by a functional form). GPR calculates the probability distribution over all admissible functions that fit the data instead of a specific function; it is a newer and happening Bayesian approach to regression. It has several benefits, including working well on small datasets and providing uncertainty measurements on the predictions.

**Classification**

The objective of Classification is to learn a function that maps input features * x* to one of the

**classes. The classes may be (and usually are) unordered.**

*k*Classification Examples:

- Classifying images based on objects being depicted
- Classifying market conditions as favorable or unfavorable
- Classifying pixels based on membership to object/background for segmentation
- Predicting the next word based on a sequence of observed words

**Popular classification techniques**

**Logistic regression** is used for binary classification problems and is named for the function it uses, the logistic function. The logistic function (or the sigmoid function) was developed to describe the properties of population growth in ecology, rising quickly and maxing out at the environment’s carrying capacity. It is an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never precisely at those limits.

The logistic function returns a real number like regression; it becomes a classification technique only when a decision threshold is brought into the picture, like true if the value is greater than 0.6, for example. The decision for the value of the threshold value is majorly affected by the values of precision and recall.

**Logistic regression** can be classified as:

**binomial**: Only two possible target variables: 0/1 – representing “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc.**multinomial**: Three or more target variables that are not ordered (i.e., types have no quantitative significance) like “disease A” vs. “disease B” vs. “disease C”.**ordinal**: it deals with target variables with ordered categories. For example, a test score can be categorized as: “very poor”, “poor”, “good”, and “very good”. Each category is given a score like 0, 1, 2, 3.

Other methods are *Random forests,* *Bayesian logistic regression, Support vector machines, Gaussian process classification, and Neural networks*.

**Supervised Learning: Formal Definition**

Problem: Given the data, the aim is to find a function.{(x_{n}, y_{n})}^{N}_{n=1}‘ that approximate the relation betweenf : X ‘→ YandX.Y

**X** and **Y** are random variables, ** X’ **and

**‘ denote the sets from where X and Y take values**

*Y*The data ** { (x_{n}, y_{n}) }^{N}_{n=1}** is the set of observations of some natural phenomenon. Since the data can be noisy,

*random numbers*are used to include possible errors and variations.

For example, ** x_{1}, x_{2}, . . . , x_{N}** denote medical images and,

**represent ground-truth diagnosis say −1 or +1. The scanner itself may introduce this noise. Doctors can make some mistakes in their diagnosis.**

*y*_{1}, y_{2}, . . . , y_{N}** Dimension** is the size of the input data i.e

**we denote this by**

*x*_{n}**; We write**

*D*

*x*_{n}= (x_{n1}, . . . , x_{nD}) ∈ ℝ^{D}In some applications dimension of each sample can vary; for example, sentences in a text or protein sequence data.

What about the response y? The dimension of ** y** is much less than

**.**

*x*The **curse of dimensionality: **The phrase, attributed to Richard Bellman, expresses the difficulty of using brute force (or grid search) to optimize a function with too many input variables.

In machine learning, a large number of features run the risk of overfitting the model — resulting in poor performance on new data. With many features, observations become harder to cluster. Observations** **appear equidistant, and no meaningful clusters can be formed. Also, the sample data cannot keep up at higher dimensions and thus becomes sparse, losing statistical significance.

Imaging problems usually will have very high dimensional data. For example, a 0.48-megapixel grayscale picture with 800 x 600 pixels is represented as a column vector of 480,000 entries, each representing an intensity value between 0 and 255.

**Foundational aspects of machine learning**

The assumption behind the statistical approach to Machine Learning is that data is assumed to be sampled from an underlying probability distribution.Statistical Approach to Machine Learning

We have obtained N samples ** x_{1}, . . . , x_{N}** ; each with N dimensions (refer above). We assume that there is a hypothetical underlying distribution

**from which these samples are drawn.**

*P*The ML problem is that *we do not know this distribution*.

Some machine learning algorithms estimate this distribution assuming some underlying distribution (**generative** approach). Other approaches try to solve problems without estimating this distribution (**discriminative** approach).

Both approaches predict the conditional probability P(prediction | features) but learn from different probabilities. The discriminative approach models the decision boundary between the classes while the generative approach explicitly models the underlying distribution of each class. The difference is that the generative model learns the joint probability distribution ** p(x,y)** to predict the conditional probability with the help of the Bayes Theorem, in contrast, the discriminative model directly learns the conditional probability distribution

**.**

*p(y|x)***Loss Function**

What is the guiding mechanism that ensures acceptable “learning”? First, we need some way to assess how good our predictions are for new input to our learned model. Loss functions define what a good prediction is and isn’t; choosing the proper loss function dictates how well the estimator will be.

The statistical models are evaluated on their performance – how accurate the model’s decisions are. Loss functions measure how far an estimated value is from its true value by mapping the decisions to their associated costs. Loss functions depend on the problems and the goals to be met.

Some of the loss functions are: *Mean Absolute Error ( MAE), Mean Squared Error (MSE), Mean Bias Error (MBE, which takes the actual difference between the target and the predicted value, not the absolute difference), Mean Squared Logarithmic Error (MSLE), Huber Loss, Binary/Categorical Cross Entropy Loss, Hinge Loss, Kullback Leibler Divergence Loss.*

Let ** l(y, f(x))** denotes the loss when

**is mapped to**

*x***, while the actual value is**

*f(x)***.**

*y*To reiterate:

and*l*are specific to the problems and a method, it is the machine learning practitioner’s call*f*- For example,
can be a squared loss and*l (.)*is linear function i.e*f(x)*.*f = w*^{T}x

**Learning as an optimization**

Our objective is to learn by optimization of the cost function.

**Function optimization** tries to find a set of inputs to a target objective function that results in the minimum or maximum of the function. As the unknown function may have very many inputs and is often non-differentiable and noisy it usually is a difficult challenge.

Objective Given a loss function ** L**, the machine learning aim is to find

**such that**

*f*

*L*

*(f) = E*_{(x,y)}

_{∼}

_{P}

*[*

*l***is minimum.**

*(Y, f(X))]*is the true loss or*L*loss or Risk.*expected*- Here
and*X*are random variables, our assumption is that these are generated from a joint distribution*Y*.*P(X, Y)*

Given a loss function

, the machine learning aim is to findLsuch that,fis minimum.L (f) = E_{(x,y)∼P}[l (Y, f(X))]Our Objective

**Empirical Risk**

But as we do not know the underlying distribution ** P** (if we knew, there was no problem to learn!), we cannot estimate the true loss. The expected value of L cannot be determined without knowing the

**.**

*P*What we have is the set of observations or samples drawn from P. If we calculate the ** average **loss over these observations, called

**, we hope to minimize the true risk too.**

*Empirical Loss*Hence we restate our objective as follows:

Find a set of parameters for the learned function to minimize the empirical loss.

We need reasonably many samples so that the Empirical Risk is close to the True Risk, or that the empirical loss is closer to the true loss.

The risk here is that the parameters that minimize the empirical risk may well overfit the model, as our guide is the average loss observed from only the observed samples. Our model will work well with new observations to predict closely with minimum error cost if the samples are large in numbers to represent the underlying distribution well.

**Generalizing Capacity**

Generalization capacity is the ability of the trained model to predict from new data. If a model has been trained too well on training data, it may not be able to generalize, making inaccurate predictions when given new data; this is called overfitting.

Overfitting happens because the model is trying too hard to capture the noise in the training dataset. As mentioned above, the input data or the set of observed samples will have some noise, random variations, or even plain errors. In the process of finding the least empirical loss, the model may try to fit the parameters to learn the noise or mistakes too. The overfitting takes the form of complex learned function ** f **. Complexity in the learned models is through the larger coefficients used in the regression models.

Regularization tries to avoid overfitting by keeping coefficients close to zero. Intuitively, smaller coefficients mean that the function the model represents is simpler and less unsteady. In other words, the regularization prevents the model from trying too hard to accommodate random variations in data, which would require complex formations.

Regularization can be as simple as shrinking or penalizing large coefficients — often called weight decay. L1 and L2 regularization are two widely used methods. More information is here and here.

In general:

** f*** is the function with a set of parameters that minimizes the empirical loss plus the regularization term

**.**

*λR(f)***controls how much regularization one needs, and**

*λ***measures the complexity of**

*R***.**

*f*This is regularized risk minimization, trying to achieve

- Small empirical error on training data, and at the same time,
needs to be simple.*f*

There is a trade-off between these two goals; ** λ** is a hyperparameter that tries to achieve this.

**Key takeaways – Learning as the optimization**

The Machine Learning problem is the optimization problem “find ** f** such that . . . “

How do we approach the problem given a set of sampled data?

- We chose
depending upon the problem to solve – the choice*f*cannot be from an arbitrary set.*f* - We fix
: the set of all possible functions that describe the relation between*F*and*X*; given training data*Y**{(x*_{n}, y_{n})}^{N}_{n=1} - For example, If
is a set of all linear functions, then we call it linear regression.*F* - Next, we choose a loss function, and our objective is to find the parameters of
such that the regularized empirical risk, as represented by the above equation, is minimum.**f**

**What have we learned?**

**Beware!**There are some underlying assumptions and approximations one needs to be comfortable with.- There is no rule book. ML practitioners must make some decisions while designing the algorithms and methods or try different approaches before deciding on a suitable one.
- Challenge is that our algorithms should work well on the unseen data and continues to do so.

How do we evaluate the performance of ML algorithms? The evaluation depends upon the problem to be solved.

**Disclaimer**:

*I currently work full-time at Swiss Re, Bengaluru. The blogs and articles on this website **www.balajos.com** are the personal posts of myself, Balachandra Joshi, and only contain my personal views, thoughts, and opinions. It is not endorsed by Swiss Re (or any of my formal employers), nor does it constitute any official communication of Swiss Re.*

*Also, please note that the opinions, views, comprehensions, impressions, deductions, etc., are my takes on the vast resources I am lucky to have encountered. No individuals or entities, including the Indian Institute of Science and NSE Talent Sprint who have shown me where to research, or the actuarial professional bodies that provide me continuous professional growth support, are responsible for any of these views; and these musings do not by any stretch of imagination represent their official stands; and they may not subscribe/support/confirm any of these views and hence can be held liable in any vicarious way. All the information in the public space is shared to share the knowledge without any commercial advantages.*