Actuaries get acquainted with likelihoods and Bayesian philosophy early in the subject ** CS1**. Most actuarial problems estimate the financial consequences of future risk events and invariably start with observing risk-event samples.

In classical statistics, the parameter set * θ* is fixed but unknown, estimated in terms of the long-run relative frequency interpretation of probability. With this approach, the measures like confidence intervals do not entertain any uncertainties, like a sense of the most probable parameter or the observations. A frequentist would not accept a parameter as a random variable because randomness, for a frequentist, is associated with variation in replicated observations (and a parameter is neither observed nor can it vary).

This luxury of having long-run relative frequency is rarely available to actuaries. Also, the best-estimate outcomes of future events that form actuarial work foundations need more robust assumptions. The Bayesian approach overcomes the rigidity by treating the parameters ** θ** as variables. A Bayesian sees a parameter as a random variable because, for Bayesian, randomness is “lack of knowledge” or “uncertainty in judgment”. This philosophy accommodates unseen data, observable but un-replicable events, or unobservable things in parameters.

Bayesian philosophy is so ingrained in our actuarial life that concepts like the credibility factor or actuarial control cycle reflect this adaptive thinking. Understanding how these two philosophies are handled in a more data-driven way in Machine Learning will help actuaries to bring back some of the developing techniques to actuarial techniques.

This blog summarizes class learnings – rephrasing the linear regression into the ** likelihood** and

**settings; and demonstrating that under certain assumptions yield the same solutions. Logistic regression is defined and shown to produce similar comparisons. Finally,**

*Bayesian***as a process is introduced, closing with the concept of the**

*optimization***.**

*perceptron algorithm*This is part of the learning log from the 3rd module, Machine Learning Basics for Real-World, from the certification course on Digital Health and Imaging. Links to the module learnings are below:

- Class 1 – Machine Learning – An actuary’s learning log
- Class 2 – ML Log – Learning Methods & Bayesian Decision insights
- Class 3 – ML Log – Rigorous explanation of Supervised Learning Approach
- Class 4 – ML Log – Pseudo-inverse estimator and gradient descent approaches to Linear Regression
- Class 5 & 6 – ML Log – Likelihoods and Bayesian Philosophy in Machine Learning

**Learning Summary – Class 5 & 6; 9 July 2022**

*Section 1– Probabilistic way to estimate* parameters

- Ordinary Least Square approach (OLS) v/s Maximum Likelihood Estimation (MLE)
- The mathematical setting of MLE problem
- Maximum Likelihood Estimator
- Bayesian framework – Maximum a Posteriori (MAP) Estimate

**Section 2 – Regression in the probabilistic setting**

- Linear Regression – MLE estimation with Gaussian errors
- Linear Regression – Bayesian point estimate with a Prior
- Logistic Regression – Sigmoid function
- Logistic Regression in MLE approach
- Multiclass Logistic or Softmax Regression

**Section 3 – Optimization**

- Gradient Descent
- Logistic Regression through gradient descent
- Stochastic gradient descent
- Hyperplane based classification – The Perceptron Algorithm (Rosenblatt, 1958)

*Section 1 – Probabilistic way to estimate* parameters

**Probabilistic view of Linear Regression – Maximum Likelihood Estimation (MLE)**

**Regression Recap**: Linear Regression is a model that maps one or more numerical inputs to a numerical output. The model is defined in terms of parameters or coefficients (beta). The model can also be described using linear algebra, with a vector for the coefficients (beta) and a matrix for the input data (** X**), and a vector for the output (

**). The betas describe the underlying model or distribution; the regression is about finding the betas from the randomly observed samples from the distribution.**

*Y*Note that the linearity is only with respect to the beta, the dependent variable can relate linearly, quadratically, cubically, etc., with respect to the features. We covered this in the part basis function expansion.

**Estimating the regression betas**

Two common ways to estimate the parameters are Ordinary Least Squares (** OLS**) Optimization and Maximum Likelihood Estimation (

**). We covered the**

*MLE***in the last part; it also provides a closed-form numerical solution. As the name suggests,**

*OLS***is a method for estimating the parameters of an**

*MLE***. The likelihood function measures the goodness of fit of the assumed model on the sampled data for given values of parameters. The estimation of parameters is done by maximizing the likelihood function so that the data we are using under the model can be more probable for the model.**

*assumed probability distribution*Both are optimization procedures that involve searching for different model parameters. Various optimization algorithms like the BFGS algorithm (or variants) and general optimization methods like stochastic gradient descent are used. The linear regression model is a particular case where an analytical solution also exists, as shown in the last part. A deep (very deep!) analysis of the connection between OLS and MLE highlights that the normality and independent and identically distributed error assumptions are the “bridge” between **OLS** and **MLE**. More specifically, if we have a linear additive model, and our ** n** error terms are all normally distributed and are

**., then the Maximum Likelihood Estimator mathematically reduces to the**

*i.i.d***estimator.**

*OLS*The **MLE** is the parameter value for which the observed data is **most likely**. Assuming that the observed samples are independent, the likelihood is simply the product of the individual probabilities of the observed values. To evaluate this joint probability, a probability model is specified depending on the kind of data, such as binomial, Poisson, exponential distribution, normal, etc. distribution. The **MLE** is obtained by varying the parameter of the distribution model until the highest likelihood is found. We can do this through a grid search or, more analytically, by setting the derivative of the likelihood with respect to the parameter. The value of the parameter for this result is called the **MLE**.

The problem with joint likelihood is that it is super small. If there are many data points, the likelihood involves the multiplication of probabilities is less than 1. Such small values close to zero are complicated for computers to handle. Hence, we use the **log-likelihood** values (also, the ugly product of probabilities turns into a simple sum). The position of the maximum will not change by such a monotone transformation. In the special case where the normal distribution is taken as the probability model, the log-likelihood turns out to be proportional to the negative sum of the squared residuals.

See here for a good visualization of how likelihood aligns with the best regression line.

**Mathematical setting of the regression problem**

** X = x_{1}, x_{2}, . . . , x_{N}** , where

*x*_{n}

*∈*

*ℝ***is the data generated from**

^{d}

*x*_{n}

*∼*

*P(x|θ)*In the statistical approach to machine learning, we assume that there is an underlying probability distribution from which the data is sampled. Hence ** θ **denotes the parameters of the distribution (for example, if samples are taken from Normally distributed data, then

*x*_{n}

*∼*

*N (x|*

*µ*

*,*

*σ***where**

*)***).**

*θ = (µ, σ)*** Assumption**: The data in

**is generated**

*X***. (independent and identically distributed).**

*i.i.d*

: LearnAimgiven the dataθX = x1, x2, . . . , x_{N}.

**The Likelihood**:

As we assume an underlying distribution, we can estimate the **joint probability** of the observed data as a function of the parameters of the statistical model. This joint probability density function is called the likelihood. For each specific parameter value ** θ** in the parameter space, the likelihood function

**therefore assigns a probabilistic prediction to the observed data**

*p(X|***)***θ***.**

*X*Two random variables ** X, Y** are independent if

**hence with the**

*P(X, Y ) = P(X )P(Y ),***assumption, the likelihood becomes just the multiplication of probability functions for each of the sampled observations.**

*i.i.d.***Hence the Solution:**

Learn

so that likelihood ofPwhich are sampled fromX = x_{1}, x_{2}, . . . , x_{N}is maximum.P

This is equivalent to estimating ** θ** so that the likelihood is maximum. Because of the

**assumption**

*i.i.d.*** Likelihood**:

*P(X|θ) = P(x1, x2, . . . , x*_{N}|θ) = P(x_{1}|θ) P(x_{2}|θ), . . . , P(X_{N}|θ)Hence, the likelihood solution finds the value of ** θ** that makes observed data

**most probable**. Furthermore, taking logs of the likelihood helps as the log-likelihood function becomes simpler. This log transformation works because it is a monotonic transformation that preserves the maximum.

Hence, the solution is ==> find ** θ** that maximizes the log-likelihood function.

*L = log P(X|θ) = logP(x _{1}|θ) + logP(x_{2}|θ) + . . . + logP(X_{N}|θ)*

You can find some examples of likelihood estimators for often used distributions here and here (the second one has examples with data).

**Bayesian framework – Maximum a Posteriori (MAP) Estimate **

In Linear Regression, we followed the frequentist approach, where the unknown quantity ** θ** is assumed to be a fixed (non-random) quantity that is to be estimated by the observed data. In the Bayesian framework, we treat the unknown quantity,

**, as a random variable. More specifically, we have some initial guesses about the distribution of**

*Θ***. This distribution is called the prior distribution. After observing some data, we update the distribution of**

*Θ***, based on the observed data.**

*Θ*We have **Bayes Rule**:

*Posterior P(θ|X) = [ Likelihood P(X|θ) * Prior P(θ) ] / Evidence P(X)*

** θ** is a Random Variable; we have some knowledge of

**as**

*θ***We will see later that**

*P(θ).***acts as a regularizer.**

*P(θ)*Compared to **MLE** where we maximized the likelihood, the idea behind **MAP** is to maximize the posterior distribution of ** θ**, or

**Hence, we define the**

*P(θ|X).***MAP**estimate

**of**

*θ*^{*}_{MAP}**as the parameter that maximizes the posterior distribution of**

*θ***given the data.**

*θ*This presentation has examples of MAP estimates when the underlying pattern is assumed to be from different distributions.

Some things to be noted:

- As the number of samples goes to infinity,
**MLE**and**MAP**become equal. This is because the prior is important when we have less data, but as we collect more and more data, the evidence overwhelms the prior. - When
is a uniform distribution, it reduces to*P(θ)***MLE**. This is to be expected, as uniform prior doesn’t provide any helpful information. - Which is better? It depends. If you are a frequentist who believes one should make a judgment only from what is observed and don’t like putting any prior belief on anything, then MLE provides a way to estimate the parameter as a fixed quantity. If you are a Bayesian, you would prefer using prior knowledge and choose to estimate the parameter as a random variable, selecting the mode as the best representative. But with enough data, the prior washes out, and the evidence prevails.
- Both MLE and MAP return a single fixed value; it is good to note that
**MAP**using Bayesian inference returns a probability density (or mass) function. In some cases, the point estimator**MAP**may not be enough; Bayesian inference provides much more information in those cases. However, the Bayesian estimate also has a drawback — the complexity of its integral computation.

**Section 2 – Regression in the probabilistic setting**

**Linear Regression – MLE estimation with Gaussian errors**

In a probabilistic setting of the linear regression problem, we assume that each response in the observed training sample is generated by a linear model and a Gaussian noise or ** Y = W^{T}X + E**. The probabilistic nature comes from the Gaussian error term.

Here:

- We have
*N*training samples*i.i.d.*(d features) and*{ (x*_{n}, y_{n}), n=1, 2, … , N }; x_{n}∈ ℝ^{d}.*y*_{n}∈ ℝ – errors have*Ɛ*_{n}∼ N (0, σ^{2})mean and*0*variance*σ*^{2}– the difference in the observed and estimation is purely due to random Gaussian noise. Note the mean, we expect each observed sample*y*_{n}∼ N (w^{T}x_{n}, σ^{2})to have the mean*x*_{n}.*w*^{T}x_{n}

Given this set-up, the probability distribution of ** Y** becomes (Remember, the below equation is in terms of matrices!):

Since these are independent and identical, the likelihood becomes the product of Normal distribution probability densities, or when taken as a log, becomes a sum.

Maximizing the likelihood is equivalent to minimizing the negative log likelihood; in other words, the solution is the set of parameters ** W** that leads to the negative-log-likelihood minimum.

**Linear Regression – Bayesian point estimate with a Prior**

The Bayesian approach provides a way to incorporate our existing knowledge about the distribution from which the samples are drawn. In linear regression, the prior also acts as a helpful regularizer. This feature can be seen by comparing the optimal solutions under OLS and MAP.

We introduce a prior for the parameter ** w** (as w is the set of parameters that defines the underlying probability distribution). In

**MAP**, the parameters are random variables.

where ** λ** is a strictly positive scalar, it quantifies of by how much we believe that

**should be close to zero, i.e., it controls the strength of the regularisation. This prior effectively produces**

*β***penalty- also known as “L2 shrinkage” or “ridge penalty” or “ridge prior” (the name “ridge prior” comes from the fact that the prior covariance matrix consists of a ridge of height**

*L2***along the main diagonal).**

*θ*With this assumption, we have a multivariate normal distribution for the set of samples observed:

A lot is happening in this equation. Instead of modeling each ** x_{1}, x_{2} , … , x_{n}** separately

**we are modeling**

*(x ∈ ℝ*^{n})**at a go. The summation sign represents a variance matrix. Refer to this video by Andrew N G for an easy explanation.**

*x*Using the Bayesian theorem, the log of the posterior probability of parameter set ** w**, given the observed

**as above, becomes:**

*x**log(w|D) = log P(w|D) + log P(w) − log P(D)*

And hence the MAP estimate is the set of parameters w that maximizes the log-likelihood.

As you can see, this is the same as having L2 regularization with the OLS objective we saw earlier.

**Logistic Regression**

Logistic regression is a method for binary classification problems, another technique borrowed by machine learning from the field of statistics. Logistic regression ** IS** a regression model building on a linear model to predict. The regression output, however, is ‘squeezed’ between 0 and 1 by the use sigmoid function (see below); hence can be leveraged as a probability measure of an outcome. Logistic regression becomes a classification technique

**a decision threshold is added to predict.**

*only when*The threshold value setting is crucial when predicting the classes, depending on the classification problem. Ideally, precision and recall should be 1, but this is seldom the case. For example, in a cancer diagnosis application, classification should err on the safe side; the objective is to minimize false negatives. In this case, the setting will be Low Precision/High Recall. On the contrary, in the case of deciding the reaction of a customer to a personalized advert, a positive response needs to be more accurate for optimal sales effort; hence the setting will be towards High Precision/Low Recall.

Types of Logistic Regression:

**Binomial**: target variable can have only two possible types: “0” or “1”, which may represent “win” vs. “loss”, “pass” vs. “fail”, “dead” vs. “alive”, etc.**Multinomial**: target variable can have three or more possible types which are not ordered(i.e., types have no quantitative significance) like “disease A” vs. “disease B” vs. “disease C”.**Ordinal**: target variables are ordered categories like “very poor”, “poor”, “good”, “very good”, or equivalent numerical scores like 0, 1, 2, 3.

**Problem Set Up**

Let us consider the binomial case (two-class classification): Instead of the exact labels, we attempt to estimate ** the probabilities** of the labels. Then, with

**,**

*x***, and**

*y***defined as earlier, the problem statement becomes:**

*w*Predict:

*P(y*_{n}= 1|x_{n}, w) = µ_{n}*P(y*_{n}= 0|x_{n}, w) = 1 − µ_{n}

**Sigmoid function**

Here ** σ** is the sigmoid or logistic function; the model first computes a real-value score

**, and non-linearly squashes it between**

*w*^{T}x**to turn this into a probability score**

*(0, 1)***.**

*µ*_{n}Logistic regression is named for the function used at the method’s core, the logistic function. The logistic or sigmoid function was developed to describe properties of population growth in ecology, rising quickly and maxing out at the carrying capacity of the environment. It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1, but never exactly at those limits.

The Decision Boundary depends on the rea-value score:

If

w^{T}x > 0 ⇒ P(y_{n}= 1|x_{n}, w) > P(y_{n}= 0 |x_{n}, w)

w^{T}x < 0 ⇒ P(y_{n}= 1|x_{n}, w) < P(y_{n}= 0 |x_{n}, w)

**Determination of w:**

Like in linear regression, we will need a loss function as a guide to determine the solution or the parameter that minimizes the losses. Unfortunately, the usual suspect, the ordinary least square (** OLS**), turns out to be non-convex and not easy to optimize. The complexity arises because of the non-linearity of the sigma function.

Squared Loss *L(y _{n}, f(x_{n}) = (y_{n} − f(x_{n}))^{2} = (y_{n} − σ(w |x_{n}) )^{2}*

Instead, a helpful option is to use cross-entropy loss as a guide to learning ** w**.

The concept of cross-entropy is from the field of Information Theory, introduced by Claude Shannon in 1948. The basic intuition behind information theory is that learning that an *unlikely event* has occurred is *more informative* than learning that a likely event has occurred. Hence:

- Low Probability Event: High Information (surprising)
- High Probability Event: Low Information (unsurprising)

The amount of information, or Entropy, in an event, can be estimated using the probability of the event (Shannon information/self-information/information). For example, for ** p(x)** — probability distribution of a random variable

**, Entropy is defined as:**

*X*Note that the log base is 2 because the units of the information measure are in bits (binary digits). This measure can be interpreted as the number of bits required to represent the event. The negative sign ensures that the result is always positive or zero (*log p(x)*** < 0** for values less than 1). Information will be zero when the probability of an event is 1.0 (a certainty).

*Back to logistic regression* – borrowing from entropy, we can get a sense of ** representation-ability** between the estimate and the ground truth. If the predicted probability of class is way different than the ground truth, the value of cross-entropy loss is high.

In particular, *cross-entropy loss**or*** log loss function **is used as a cost function for logistic regression models (or models with softmax output for multinomial logistic regression or neural network) to estimate the parameters.

Hence, for a ** particular sample x_{n} **and its estimation, the loss becomes:

*l(y _{n}, f(x_{n})) = −y_{n} log(µ_{n}) − (1 − y_{n}) log(1 − µ_{n})*

* ** = −y _{n} log( σ(w |x_{n}) ) – (1 − y_{n}) log(1 − σ(w |x_{n}) )*

Summing over all the observations and adding an ** L2** regularization, the loss function over sample space becomes:

The terms like ** y_{n}w^{T}x_{n }**in the above equation can be derived by substituting for

**and expanding the terms (hint the term**

*µ*_{n }**cancels out!).**

*y*_{n}(1+ exp(w^{T}x_{n}))Hence the problem becomes finding a set of parameters ** w **such that the above loss function is minimized.

**Logistic Regression – MLE approach**

Given ** D = {(x_{1}, y_{1}), . . . ,(x_{N} , y_{N} )}, P(y_{n} = 1|x_{n}, w) = µ_{n} and P(y_{n} = 0|x_{n}, w) = (1 − µ_{n} );** the log-likelihood becomes:

Substituting for ** µ_{n}** and adding

**regularization term, the solution is the same as obtained by the cross-entropy method above.**

*L2*The solution using the **MAP** approach using the Bayesian prior also turns out to be the cross-entropy loss minimization with ** L2** regularization. The

**prior taken is the same as previously covered in the last blog, leading to multivariate Gaussian distribution likelihood estimations.**

*w***Multiclass Logistic or Softmax Regression**

Softmax Regression (also known as Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that is used for multi-class classification under the assumption that the classes are mutually exclusive). The softmax function replaces the sigmoid logistic function:

For ** x = x_{n}**, the above formula is the probability that

*n*^{th}sample belongs to

*k*^{th}class. The sum of probabilities belonging to k classes is equal to

**. Class**

*1***with the largest**

*k***dominates the probability.**

*w*^{T}_{k}x_{n}This link has an excellent visual representation of logistic and soft-max regression and example python implementation.

*Section 3 – Optimization*

In most cases, machine learning involves learning and generalizing an algorithm from historical data, with the objective of making predictions on new data. We can describe this problem as approximating a function that maps examples of inputs to examples of outputs by selecting *a function* from a well-defined class of functions (a line or a hyperplane) such that the loss function is minimized for the set of training data. This process is close to **function approximation** (to select a function among a well-defined class that closely matches a target function in a task-specific way). In reality, this process can be complicated in case of a small sample size containing noise and distortions, or a suitable structure is nonlinear, not differentiable.

**Function optimization** is often simpler than *function approximation*. In the simplest case, an optimization problem consists of maximizing or minimizing a real function by systematically choosing input values from within an allowed set and computing the value of the function.

Approximating a function can be solved by framing the problem as function optimization. Specifically, in ML, for a parameterized *mapping function* (e.g., a weighted sum of inputs), an optimization algorithm is used to fund the *values of the parameters* (model coefficients) that minimize the mapping function’s error.

At the core of nearly all machine learning algorithms is an optimization algorithm. Often the optimization is at multiple levels, including:

- Choosing the hyperparameters of a model.
- Choosing the transforms to apply to the data before modeling
- Choosing the modeling pipeline to use as the final model.

Optimization is the Key, read here for a fuller description of the process. Two factors decide the fate of any method: what kind of optimization problem we are led to and what are all optimization methods available to us.

Several methods are available for optimization; among these, gradient descent methods are the most popular. Gradient Descent methods are Used in Linear Regression, Logistic Regression (It is just classification, but instead of labels, it gives us class probability), Support Vector Machines, Neural Networks, etc.

The idea of the Gradient Descent Algorithm

- Start at some random point (of course, final results will depend on this)
- Take steps based on the gradient vector of the current position till convergence
- Gradient vectors give us direction and rate of fastest increase at any point
- Any point
if the gradient is nonzero, then the direction of the gradient is the direction in which the function most quickly from*x**x* - The magnitude of the gradient is the rate of increase in that direction.

**Logistic Regression through gradient descent**

This lecture note provides an excellent summary on LR, cross-entropy loss and gradient descent method/

The solution to logistic regression is to minimize the below cost function:

Since there is no closed-form solution, we resort to an iterative gradient descent method. This loss function is conveniently convex, with a single minimum; there are no local minima to get stuck in, so gradient descent starting from any point is guaranteed to find the minimum.

Gradient descent aims to find the optimal weights to minimize the loss function. The loss function ** L** is parameterized by the weights

**(or equivalently, we are now looking at parameter space. The samples have been observed as a fact). So, the goal is to find the set of weights that minimizes the loss function, averaged over all examples.**

*w ∈ ℝ*^{D}The gradient descent method finds a minimum of a function by figuring out in which direction (in the space of the parameters ** w**) the function’s slope rises the most steeply and moves in the opposite direction.

For the moment ignoring the regularization (along with the learning rate, the regularization affects the adjustments to parameters at each iteration), the ** gradient** of the loss function is:

Note the Transition of the variable orders in the sum from ** sum((y_{n} – µ_{n} ) x_{n}** to matrix representation X

^{-1}(µ- Y).

To make sense of this, note that he final multiplication is between ** D × N** and

**matrices, leading to**

*N × 1***column matrix. Since**

*D × 1***we will need derivatives with respect to**

*w ∈ ℝ*^{D}**components of**

*D***.**

*w*This gradient of the loss function is a vector pointing in the direction of the greatest increase in the loss function. The gradient is a multi-variable generalization of the slope; we can informally think of the gradient as the slope. Like the single weight case, gradient descent tells us to go in the opposite direction of the greatest increase to find the minimum of the loss function.

**Gradient descent process**:

- Initialize
randomly.*w*^{(1)}∈ ℝ^{D} - Iterate until the convergence.

**Learning rate η**:

The change to the parameters made in each iterative step is the learning rate times the gradient ** η**. A higher (faster) learning rate means we should move

**more on each step. Compared to the slope in the case with a single weight, we don’t just want to move left or right; we want to know where in the**

*w***-dimensional space (of the**

*D***parameters that make up**

*D***) we should move.**

*w*The gradient is such a vector; it expresses the directional components of the sharpest slope along each ** D** dimension. For two

**, the gradient might be a vector with two orthogonal components, each of which tells us how much the ground slopes in each of the dimensions.**

*D = 2*The learning rate ** η** is a

*hyperparameter*that must be adjusted. If it’s too high, the learner will take steps that are too large, overshooting the minimum of the loss function. If it’s too low, the learner will take steps that are too small, and take too long to get to the minimum. It is common to start with a higher learning rate and then slowly decrease it, so that it is a function of the iteration k of training; the notation

**can be used to mean the value of the learning rate at iteration**

*η*_{k}**.**

*k*Couple of ways to find the optimal value of ** η**:

- Choose optimal step size at each iteration ηt using line search (as mentioned above).
- Add momentum to the update:
(builds inertia in a direction in the search space and overcomes the oscillations of noisy gradients and coast across flat spots of the search space)*w*^{(t+1)}= w^{(t)}− η_{(t)}g^{(t)}+ α_{t}( w^{(t)}− w^{(t−1)}) - Use second-order methods like the Newton method to exploit the curvature of the loss function. (But we need to compute the Hessian matrix.)

This lecture note provides a detailed explanation of these methods.

**Stochastic gradient descent**:

Calculating the gradient in each iteration requires all the data. Although using the whole dataset helps get to the minima less noisy and randomly, the problem arises when our dataset gets big. When N is large, this may not be feasible or computationally very expensive.

The word ** stochastic** means a system or process linked with a random probability. Instead of all data, in SGD, a few samples are selected randomly. These few samples together are called a

*batch*. (In the gradient descent approach mentioned above, the entire data set is a batch).

Ideally, only a single sample (batch size of one) is taken to perform each iteration in SGD. The sample is randomly shuffled and selected for performing the iteration, reducing the computations enormously. It is also common to sample a small number of data points instead of just one point at each step (or mini-batch gradient descent). Mini batch tries to strike a balance between the goodness of gradient descent and the speed of SGD.

**Hyperplane based classification – ****The Perceptron Algorithm (Rosenblatt, 1958)**

With some modification, the stochastic gradient descent process leads to interesting constructs.

Let us approximate the gradient using the randomly chosen data point ** (x_{n}, y_{n})**. Without summing over entire the sample dataset, the summation in the update term now becomes just the difference between sigmoid output and the observed value.

*w ^{(t+1)} = w^{(t)} − η_{(t)} x_{n} ( µ_{n}*

^{(t)}

*− y*_{n})We can further replace the predicted label probability *µ _{n}*

^{(t)}by predicted binary label

**, where**

*ŷ*_{n}^{(t)}- if
if*ŷ*_{n}^{(t)}= 1and*w*^{(t)T}x_{n}>= 0 *ŷ*_{n}^{(t)}if*= 0**w*^{(t)T}x_{n}>= 0

Hence the update rule becomes:

*w ^{(t+1)} = w^{(t)} − η_{(t)} x_{n} ( ŷ _{n}^{(t)} − y_{n} )*

This modification leads to *mistake driven update rule*; ** w^{(t)}** gets updated only when there is a misclassification. If we further simplify the system by changing the class label to {-1, +1}, or effectively

**.**

*y*_{i}∈ {-1,1}Hence the update ** w^{(t+1)} = w^{(t)} − 2η_{(t)}y_{n}x_{n}** happens only when there is misclassification.

This effectively is a *perceptron learning algorithm* which is a *hyperplane-based* learning algorithm.

The aim is to learn a linear hyperplane to separate two classes.

- Mistake drives online (stochastic or batch-size=1) learning algorithm
- Guaranteed to find a separating hyperplane if data is linearly separable.
- Initialize weights to zero;
*w*_{old}= [0, … , 0], b_{old}= 0 - Iterate with random samples, updating the weights by the above formula only in case of an incorrect prediction.
- If the data is linearly separable perception algorithm converges.

But in practice, most often, the data is not linearly separable. Then

- Make the data linearly separable using kernel methods (leads to Support Vector Machines, which rules machine learning for decades)
- (Or) Use multilayer perceptron (leads to Deep Learning!)

Please see these lecture notes for intuitive and detailed explanations of the Perceptron Algorithm and Hyperplane based Classification.

**Disclaimer**:

*I currently work full-time at Swiss Re, Bengaluru. The blogs and articles on this website **www.balajos.com** are the personal posts of myself, Balachandra Joshi, and only contain my personal views, thoughts, and opinions. It is not endorsed by Swiss Re (or any of my formal employers), nor does it constitute any official communication of Swiss Re.*

*Also, please note that the opinions, views, comprehensions, impressions, deductions, etc., are my takes on the vast resources I am lucky to have encountered. No individuals or entities, including the Indian Institute of Science and NSE Talent Sprint who have shown me where to research, or the actuarial professional bodies that provide me continuous professional growth support, are responsible for any of these views; and these musings do not by any stretch of imagination represent their official stands; and they may not subscribe/support/confirm any of these views and hence can be held liable in any vicarious way. All the information in the public space is shared to share the knowledge without any commercial advantages.*