neural network, artificial neural network, human brain-3637503.jpg

Class 2 – ML Log – Learning Methods & Bayesian Decision insights

Posted by

How do young actuaries get ready for the brave new data world? One sure way is to keep learning pace even after qualification. After all, it is a choice; either you equip yourself to ride the disruption wave everyone seems to be speaking about (but have no clue how), or you go under the wave, like hoi polloi. 

The traditional learning from data in the actuarial field is with the statistical rigor of the era, not long ago, but with limited capability to wrangle data as we can today. The A/E of experience studies in the life & health insurance and run-off tringles in general insurance are the main ways (or likely the only way, given the legacy data systems) of learning feedback loop. As the sector starts digitally interacting with other enabling platforms, the way we learn from data will evolve too. 

This blog covers the learning approaches and Bayesian Decision theory; note the very recent AI/ML learning paradigms. Actuaries aren’t new to Bayes Decision Theory, but we haven’t been consciously leveraging the capabilities. The actuarial control cycle and our pricing approaches do reflect the active feedback, but the underlying wonder of the Baysian method gets missed in the processes and practice standards. 

Read for a different perspective than the actuarial course perspective, specifically, how decisions are made based on the likelihood ratio.

his is part of the learning log from the 3rd module, Machine Learning Basics for Real-World, from the certification course on Digital Health and Imaging. Links to the module learnings are below:


Learning Summary – Class 2; 26 June 2022:

Part 1 – What is learning in Machine Learning? – Classification of Learning Approaches

  1. Learn by exploring data: Supervised learning includes classification (Linear Regression, Regression Trees, Non-Linear Regression, Bayesian Linear Regression, and Polynomial Regression) and regression (Random Forest, Decision Trees, Logistic Regression, and Support Vector Machines). Unsupervised learning includes clustering, association/density estimation, visualization, projection (Principal Component Analysis)
  2. Learn from data; in more challenging circumstances – Semi-supervised Learning, Domain adaptation, Active Learning
  3. Learn by interacting with an environment – Reinforcement learning, Multi-armed Bandits
  4. Very recent challenging AI paradigms – Zero/One/Few-shot Learning, Transfer Learning, Multi-agent reinforcement learning 

Part – 2: Bayesian Decision Theory

  1. Review of probability terms
  2. Bayesian Decision Theory
  3. Error Analysis
  4. Generalizing the Bayesian Decision Theory
  5. A two-category Classification and selection threshold
  6. Loss functions – zero-one threshold

The second class on Sunday afternoon was overwhelming. The professor briefly explained the learning methods. Since I already had acquainted with some of these methods, I had to revise to get clarity and link those hazy ideas into this new structure. The Bayesian decision method triggered almost another panic attack, the probability theory is really elusive. If one takes a long break from the theory, it breeds probabilistic doubts in the mind about how much was actually retained!


What is learning in Machine Learning?

It is hard to define learning in a universal sense precisely. We can learn to solve some problems deterministically where a hardcoded logical solution is feasible, like sorting an array or finding the area given the perimeter descriptions. However, Real-world is usually a more complex interaction of many simpler processes, needing an iterative algorithm relying on past observations and getting better as more data is observed. With increased computing capabilities, AI/ML methods enable more such algorithms to be conceived and developed.  

Actuaries, in a sense, work in this iterative way by default. The pricing of risks or managing in-force risk portfolios has adapted to the actuarial feedback loop approach. The portfolios are subject to complex interactions of influencers varying in nature from demographic, economic, man-made, or natural hazards. It isn’t surprising to see many common mathematical and statistical concepts between the actuarial and emerging AI/ML fields. 

Classification of Learning Approaches 

The context of learning in machine learning is acquiring skills or knowledge from experience or equivalently synthesizing useful concepts from historical data. Some typical classifications describe whole subfields of the study comprised of many different types of algorithms, such as supervised learning. One way to classify the learning methods is based on how we interact with the data to solve the problems of predictions. 

1 Learn by exploring data 

Supervised learning involves models that learn a mapping between input examples and the target variable. The training data includes the input vectors and their corresponding result/ground truth vectors. In this case, learning is discovering an algorithm or a mapping that predicts the results on new examples beyond the training set.

Learning is supervised as the algorithms learn from the target y being provided; in that sense, the target labels in the learning data monitor the algorithm to predict better. Supervised learning can be split into Classification, which involves predicting a class label, and Regression which consists of predicting a numerical label. The problems may have one or more input variables, and input variables may be any data type. 

Regression algorithms are used if we know a relationship between the input and output variables. For example, it is used to predict continuous variables, such as weather forecasting, market trends, etc. Some popular Regression algorithms under supervised learning are Linear Regression, Regression Trees, Non-Linear Regression, Bayesian Linear Regression, and Polynomial Regression.

Classification algorithms are used for categorical variables, meaning there are two classes: Yes-No, Male-Female, True-false, etc. Random Forest, Decision Trees, Logistic Regression, and Support Vector Machines are some examples of classification algorithms. An excellent example of a classification problem is separating spam from regular emails or recognizing handwritten digits 0-9.  

Some algorithms may be specifically designed for both types of problems with minor modifications (such as artificial neural networks).

Advantages of Supervised learning: 

  • The algorithms use prior experiences, 
  • We get to know the classes of objects, 
  • has helped to solve fraud detection, spam filtering, etc.

Disadvantages of supervised learning: 

  • Not suitable for handling complex tasks, 
  • prediction on new data may be poor if test data is different from the training dataset, 
  • Training requires lots of computation times. 
  • knowledge about the classes of objects is needed

Unsupervised learning learns only the input data without outputs or target variables; the algorithms must learn to make sense of the data without knowing the ground truth. The input to the unsupervised learning models is likely to be unstructured and may contain noisy data, missing values, or unknown data. Two main classes in unsupervised learning are clustering and association/density estimation. 

Clustering groups the objects into clusters, objects in a group will have the most similarities and have less or no similarities with the things of another group. Clustering is about finding statistical commonalities between the data objects. 

Association is a rule-based technique attempting to find useful relations between variables in a large database in the form of probability distributions. For example, the algorithms track the items that tend to occur together, such as people buying bread are also likely to purchase butter or jam. 

Visualization involves graphing/plotting data in different ways; an example of a visualization technique is creating a scatter plot matrix for each pair of variables in the dataset. Projection methods involve reducing the dimensionality of the data by creating lower-dimensional representations of data. Principal Component Analysis uses the projection method to summarize a dataset in terms of eigenvalues and eigenvectors, with linear dependencies removed.

Some popular unsupervised learning algorithms are K-means clustering, KNN (k-nearest neighbors), Hierarchal clustering, Anomaly detection, Neural Networks, Principle Component Analysis, Independent Component Analysis, Apriori algorithm, and Singular value decomposition

Advantages of Unsupervised Learning: 

  • Can be used for more complex tasks, 
  • is preferable as it is easier to get unlabelled data than labeled data.

Disadvantages of Unsupervised Learning: 

  • Intrinsically more complex, 
  • The results may be less accurate as algorithms do not know the exact output in advance.

2 Learn from data; in more challenging circumstances

Semi-supervised Learning is learning with the middle ground between supervised and unsupervised learning algorithms. The training data will have very few labeled and many unlabelled examples; this way, the learning effectively uses all available data. Even the labels may not be very accurate truths. The abundant unlabelled data provides clues to discover groups or patterns. Then the supervised methods can be used to label the unlabelled examples or use the learning to make the predictions.

Problems from the fields with less labeled data like computer vision (image data), natural language processing (text data), and automatic speech recognition (audio data) benefit from semi-supervised learning. They cannot be easily handled using standard supervised learning methods.

Machine learning performance depends on the dataset it is trained on; imperfections in the data affect the models. One type of problem is domain shift, where a model trained to learn one dataset may not be able to perform the same task on even a slightly different dataset. For example, a model trained to detect dogs in outdoor settings might fail to detect dogs indoors because the backgrounds are different. 

Domain adaptation applies an algorithm trained in one source domain to a different (but related) target domain. The source and target domains have the same feature space (but different distributions). Domain adaptation is a subcategory of transfer learning; transfer learning includes broader cases where the target domain’s feature space differs. 

Domains are the combination of an input space X, an output space Y, and an associated probability distribution p. Two domains can be called different if they differ in these characteristics. The common approaches to the domain adaptation method follow an adversarial approach. Adversarial machine learning aims to trick machine learning models by providing deceptive input, including the generation and detection of adversarial examples.

Active Learning prioritizes the data which needs to be labeled to have the highest impact on training a supervised model. The learning algorithm can interactively query and label new input data points to better the model fit. In statistics literature, it is referred to as optimal experimental design. Active learning is used when the amount of data is too large to be labeled, and labeling takes effort and cost. 

3 Learn by interacting with an environment 

Reinforcement learning, supervised learning, and unsupervised learning are three ML paradigms. Reinforcement Learning enables an agent to learn in an interactive environment by trial and error using feedback from its actions and experiences. Instead of the fixed training dataset, there is a goal or set of the goals that the learning agent is required to achieve, actions they may perform, and the related performance feedback. Basic reinforcement learning is modeled as a Markov decision process.

Some examples are Q-learning, temporal-difference learning, and deep reinforcement learning. Due to its generality, reinforcement learning is studied in many disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, and statistics. 

Multi-armed Bandits problems are some of the simplest reinforcement learning problems. A player is faced with k slot machines or bandits (because historically, casinos set them up so that players get a raw deal!), each with a different reward distribution. The player is trying to maximize his cumulative reward based on trials. The trade-off he faces at each trial is between betting on the machines he knows to have the highest expected payoff and trying out the other machines that may provide even better gains.   

4 Very recent challenging AI paradigms 

Zero/One/Few-shot Learning: Few-Shot Learning is about classifying new data using supervised information with only a few training samples. FSL is a young area that needs more research and refinement. However, it has been successful in computer vision tasks, for example, categorizing bone illnesses via x-ray photos, where enough images aren’t available. Zero-Shot Learning tries to classify unseen classes without any training examples but with the general idea of an object, its appearance, properties, and functionality. 

In general, the input volume comes with the cost, time, pre-processing, computational, etc., costs; companies can reduce data analysis/machine learning (ML) costs. 

A model trained on one task is re-purposed on a second related task in Transfer Learning. Reusing a previously learned model on a new problem is particularly popular in deep learning as a tool to train deep neural networks with a small amount of data. 

In the deep neural networks trained on images, the early layers of the network learn low-level features like detecting edges, colors, variations of intensities, etc. These low-level features appear not specific to a particular dataset or a task. Low-level features learned are hence transferable between tasks. As a result, it is now uncommon to see whole convolutional neural networks being trained from scratch; pre-trained models trained on various images in a similar task are used instead. 

Multi-agent reinforcement learning focuses on studying the behavior of multiple agents that coexist in a shared environment. Each agent is motivated by its rewards and does actions to advance its own interests; these interests may be opposed to the interests of other agents, resulting in complex group dynamics. Multi-agent reinforcement learning is closely related to game theory and especially repeated games. More importantly, its study combines the pursuit of finding ideal algorithms that maximize rewards with a more sociological set of concepts. 


Bayesian Decision Theory

Bayesian decision theory is a statistical approach to the problem of pattern classification. It is considered the ideal pattern classifier because its decision rule automatically minimizes its loss function. The approach quantifies the trade-offs between various classification decisions using probability and minimizes the costs of such decisions. The underlying assumption is that the decision can be made in probabilistic terms; this requires that all the relevant probability values are known. 

Review of probability terms

Some terms need reiterating to understand the Bayesian way. I found looking at the below frequency table helpful in connecting to the concepts. 

Joint probability calculates the probability of two events occurring together at the same time:

P(A and B) or P(A, B) or P(A ∩ B).

So, for example, the probability of finding a female football lover in the table is 0.05. Note this is specific to this group that answered queries on their preferences. 

Marginal probability is the one usually referred to as the probability of an event; it is the probability of an event irrespective of any other event/circumstance. Interestingly, these are the probabilities found in the margins of the table above. For example, the probability of being a male in the surveyed group above is 0.54, irrespective of the sports preference. Similarly, the probability of finding a cricket fan is 0.39, regardless of gender. It is denoted by P(A) and read as “probability of A“.

Conditional Probability is a bit tricky to handle. It is the probability of one event occurring given that another event has happened (by assumption, presumption, assertion, or evidence). Alternatively, it is the probability of occurrence of an event A when another event B has already occurred. It is denoted by P(A|B) and read as “probability of A given B”.

In our table, how do we estimate from our table the probability that a person is a football lover when we know it is a female respondent? The fact that we know the respondent is a woman reduces the overall cohort that we are looking for. So now we are looking for a subset of 500 respondents or responses from 230 women to decide the chances that a particular woman respondent is a football lover. 

So, we have 25 football lovers out of 230 women; hence the required probability is 25/230 or about 0.11 (note this is different from the number 0.05 in the table). To get the overall perspective of the survey, let us divide the numerator and denominator by the total number of responses. This mathematical trick provides an interesting and intuitive link to the overall survey. 

\mathbf{P(football lover | female) = \frac{25}{230}  =  \frac{\frac{25}{500}}{\frac{230}{500}}  =  \frac{P(football \; lover \; AND \; female)}{P(female)}}

Hence a formal definition of conditional probability is: 

This insight results in two useful formulas we need for navigating the Bayes world. 

The first one is; by multiplying both sides of the equation above by P(B); we get  P(A and B) = P(A | B) P(B)

From this equation you also can see that to calculate P(A and B), instead of using the & operator, we can use the product of two probabilities. P(A|B) P(B)  is the probability of A occurring given that event B has occurred multiplied by the probability of event B happening. Pay close attention; you can interchange A and B. 

Let us consider the table again. First, note how you can consider taking column as the event happened (you know gender first and then decide to check the sports preferred). Equivalently you can ascertain sports preferred first, then check the proportion of males. Both will yield the same proportion of the total surveyed; as you can see, the row or the column totals cancel out in the equation. 

P(Male and Cricket Lover) = P(Cricket Lover | Male) * P(Male)
= 120 / 270 * 270 / 500 = 120/500
Or equivalently
P(Male and Cricket Lover) = P(Male | Cricket Lover) * P(Cricket Lover)
= 120 / 195 * 195 / 500 = 120/500

This intuition leads us straight to Bayes’ Theorem:

P(A and B) = P(A|B) P(B) = P(B|A) P(A) and hence

If this looks like mathematical jugglery and you feel lost, don’t worry, you are not alone! The next sections hopefully will relieve that uneasy feeling.

The second insight we get from the conditional probability equation above is the law of total probability

P(A) = P(B1 and A) + P(B2 and A)

The total probability of event A is the sum of two possibilities. Either the events B1 and A occur together, OR events B2 and A occur together. Here we assume that B1 and B2 are:

  • Mutually exclusive, only one of them can be true, and
  • Collectively exhaustive, one of them must be true.

Going back to our table, the probability that the gender is male can be seen as the sum of proportions of the cricket, football, and other male sports lovers, proportions taken over all fans. 

Or the probability that a person loves football is the proportion of males and females who love football, proportions taken over the total survey number. 

The second one we need is the law of total probability

P(A) = P(B1 and A) + P(B2 and A)

The total probability of event A is the sum of two possibilities. Either the events B1 and A occur together, OR events B2 and A occur together. Here we assume that B1 and B2 are:

  1. Mutually exclusive, only one of them can be true, and
  2. Collectively exhaustive, one of them must be true.

Going back to our table, the probability that a person loves football (A) is the proportion of males (B1) and females (B2) who love football, proportions taken over the total survey number. 

Or the probability that the gender is male (A) can be seen as the sum of proportions of the cricket (B1), football (B2), and other (B3) male sports lovers, proportions taken over all fans. 

A subtle difference between being likable and predictable (difference between Likelihood and Probability!)

The distinction between probability and likelihood is fundamentally important; Probability attaches to possible results; likelihood attaches to hypotheses.

Possible results/outcomes are mutually exclusive and exhaustive. For example, predicting the outcome of each of 10 tosses of a coin has only 11 possible results (0 to 10 correct predictions). The actual result will always be one and only one of the possible results. Thus, the probabilities attached to the possible results must sum to 1. 

But the hypotheses do not have these properties. The set of beliefs to which we attach likelihoods is limited only by our capacity to dream them up. In practice, we can rarely be confident that we have imagined all the possible hypotheses. All we can do is make calls about which theory is more ‘likely’ than the other. 

Take an example of an actuarial exam (we will refer to this more later); we can make two hypotheses, one assumes that it takes an average of 200 hours of study to pass an SP level exam, and the other assumes that it takes 400 hours. Now, if you observe out of 100 students who have passed, 60 students state that they studied about 450 hours (can’t really trust them you know, but still!), then our second hypothesis appears more likely. 


Now to Bayesian Decision Theory:

Adapted from Bayesian Decision Theory (hacettepe.edu.tr)

As mentioned earlier, Bayesian decision theory is a statistical approach to pattern classification. It tries to quantify the trade-offs between classifications using probabilities and costs of decisions. A fundamental assumption is that we know all the probabilities needed for decision-making are known. 

Let us consider the simplistic two-class classification problem. For two classes, ω1 and ω2, let prior probabilities for an unknown new observation be:

  • P(ω1) : the new observation belongs to class 1
  • P(ω2) : the new observation belongs to class 2
  • P(ω1) + P(ω2) = 1 

The textbook example considers a fisherman trying to classify his catch between Sea bass P(ω1) and Salmon P(ω2). These chances he knows from his experience, like in a given season there is a more probability of catching sea bass or in a particular area probability of getting salmon is more.  It helps to specify the structure of the problems. 

  • State of nature ω: Two classes or states fully represent the state space; they are exclusive & exhaustive: e.g., ω1 for sea bass, ω2 for salmon as in the problems. 
  • Priors – Probabilities P(ω1) and P(ω2) : These probabilities reflect our prior knowledge of how likely we will expect a specific state of nature before we actually observe the state. For example, how likely we hope to catch a sea bass or a salmon.

Decision Rule from only priors: Our decision rule when no feature on the new object is available:

Classify as class 1 if P(ω1P(ω2)

The above decision rule appears reasonable if all our information were only priors!

  • But we will always choose the same fish (or always predict failure in the exam, no more actuaries!)
  • If the priors have equal probabilities (or uniform), then we will make poor decisions
  • But under the given assumptions, no other rule can do better!

We can also look at errors; remember, Bayes decision rule objective is to minimize the errors. Therefore, we always choose the state that has the maximum probability. Since there are only two classes, we can quantify the error as the smaller of the probabilities. Hence

P(error) = min[P(ω1) + P(ω2)]

So far, good; now let us see how Bayes theorem gets in the picture. We saw features and feature space in first class. A feature is an observable variable of the state; for example, the length, width, location of the dorsal fin, or weight are all observable variables for the fish problem above. Now suppose we have access to one of these features, and we have observed frequency information. For example, we have noted down the weights of each fish and if the fish is seabass or salmon.

We can put these observations into a histogram by counting them into suitable groups and then draw an approximation of their probability distributions. The probability distribution curves represent the continuous approximation of the frequency tables. Since the observations are made for each class (of fish), these are class conditional probability densities.

Suppose the conditional distributions are as below. (Remember, this is additional information we have observed, nothing to do with our priors yet)

Now we are introducing a new term, the likelihoods. The class conditional probabilities are called likelihoods because these probabilities represent how likely a fish caught weighing 12 on the x-axis is to be a seabass or salmon. The thing to be noted here is that this is purely based on our observations and has nothing to do with the prior beliefs. 

These likelihoods provide another rule for deciding:

Decide ω1 if P(x|ω1) > P(x|ω2) ; ω2 otherwise

But this likelihood rule doesn’t consider any prior knowledge (or priors) considered above. The Bayesian formulation helps combine prior knowledge and class conditional probabilities into a single rule. Let us start from the joint probability of finding a pattern in the category ωj (from the prior) and has the feature value of x. As we have seen above, we can write the joint probability density p(ωj , x)  in two ways.

P(ωj |x) p(x) = p(x|ωj ) P(ωj )

What we really are saying in the first/left part of the equation is that the term P(ωj |x) is the appropriate pmf that will have produced the observed pattern of x. If we were prophetic, this would have been our prior! But then we are neither psychic nor prophetic; we had started with something we thought was right, as priors P(ωj ). So the difference is balanced by p(x|ωj ) on the right-hand side of the equation. 

Rearranging to get the Bayes formula:

From the law of total probability, we can get the probability density function as p(x) = p(x|ωj )P(ωj ) + p(x|ωj )P(ωj ). W1 and w2 are exclusive and complementary; we can split our observations into any such partition. This p(x) also turns out to be a marginal distribution of x, which is the probability of the data under any hypothesis. 

Bayesian world denotes the equation with specific names.

Posterior = [ likelihood * prior ] / evidence 

Bayes formula shows that by observing the value of x, we can convert the prior probability P(ωj ) to the posterior probability P(ωj |x); the probability of the state of nature being ωj given that feature value x has been measured. Another way to think of the Bayes theorem is that we can update the probability of a hypothesis, H, considering some data, D. 

p(Hypothesis | Data) = [p( Data | Hypothesis) * p(Hypothesis)] / p(Data) 

Now we can frame our Decision Rule based on posterior distributions: 

Decide ω1 if P(ω1 /x) > P(ω2 /x) ; otherwise decide ω2

The evidence factor p(x) is common on both sides; it turns out that it is there to guarantee that the posterior is a probability distribution (sums to 1). While comparing the probabilities, the proportions are of interest:

Hence our Bayesian decision rule: 

Decide ω1 if p(x/ω1 ) * P(ω1 ) > p(x/ω2 ) * P(ω2 ) ; otherwise decide ω2 

The impact of priors can be seen in the below picture. Thus, in this case, given that a pattern is measured to have feature value x = 14, the probability it is in category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the sum of the posterior is 1.0.

If this seems complicated to comprehend, please see here for an alternative explanation; the highly skewed P(ω1) affects the posterior for ω2

Error Analysis

Bayes Decision theory is fundamentally about minimizing errors. Whenever we decide on a particular x 

P(error|x) = P(ω1|x) if we decide ω2 or = P(ω2|x) if we decide ω1 

Clearly, for a given x we have minimized the probability of error by choosing w1 or w2 with the highest probability. But what about the average probability of error? If we ensure that for every x, P( error|x ) is as small as possible, the sum of errors must be as small as possible. Thus we have justified the following Bayes decision minimizing the probability of error.


Generalizing the Bayesian Decision Theory 

These above ideas can be generalized in four ways: 

  • by allowing than one feature; x ∈ ℝD is a feature vector 
  • by allowing more than two states of nature 1, ω2, . . . , ωc: a finite set of classes
  • by allowing actions other than merely deciding the state of nature 1, α2, . . . , αa} : a finite set of actions. We now allow the possibility of rejection or no decision in close cases when these actions are not too costly.
  • by introducing a loss function more general than the probability of error; λ(αij ) , i = 1, 2, . . . , a and j = 1, 2, . . . , c : denotes a loss function that describes loss for taking action αi when the of the x value is ωi

Bayesian decision theory assumes we fully know:

  • P(x|ωj ), j = 1, 2, . . . , c : class conditional probability density function or likelihood 
  • P(ωj ), j = 1, 2, . . . , c : prior probabilities, and
  • Posterior probabilities P(ωi |x) ; j = 1, 2, . . . , c can be calculated using the Bayes formula mentioned above, where now the evidence is

Suppose that we observe a particular x and that we take action αi.

If the actual state of nature is ωj , the loss is λ(αij ). Since P(ωj |x) is the probability that the actual state of nature is ωj , the expected loss associated with taking action αi is

R(αi|x) = [ λ(αij )P(ωj |x) integrated over values of j=1 to c ].

The expected loss is called a risk, and R(αi |x) is called conditional risk. For a particular observation x, we can minimize our expected loss by selecting the action that minimizes the conditional risk.

Decision Rule in generalized setup

In this generalized setup, our problem is to find a decision rule against P(ωj ) that minimizes the overall risk. A general decision rule is a function α(x) that tells us which action (from α1, … , αa) to take for every possible observation. 

Clearly, if α(x) is chosen so that R( αi(x) ) is as small as possible for every x, then the overall risk will be minimized; this justifies the following statement of the Bayes decision rule: 

To minimize the overall risk, compute the conditional risk R(αi|x) (as defined above) for i = 1,…, a and select the action αi for which R( αi|x) is minimum.

The resulting minimum overall risk is called the Bayes risk, denoted R , and is the best performance that can be achieved.

A two-category Classification and selection threshold:

Consider a two-class state, Let

  • α1 – action deciding that the true state of nature is ω1, and
  • α2 – action deciding that the true state of nature is ω2
  • λij = λ(αij ), the loss incurred for selecting ωi when the true state of nature is ωj .

The Bayes risk for any observation x then becomes

  • R(α1|x) = λ11P(ω1|x) + λ12P(ω2|x) ; if we chose ω1 and 
  • R(α2|x) = λ21P(ω1|x) + λ22P(ω2|x) ; if we chose ω2

Then the updated decision rule is 

Decide ω1 if R (α1|x ) < R (α2|x )

or Decide ω1 if λ11 P(ω1|x) + λ12 P(ω2|x) < λ21 P(ω1|x) + λ22 P(ω2|x )

λ21 = λ(α21 ) is the loss incurred for being wrong, and λ11 = λ(α11) is the loss incurred for being right. Ordinarily, the loss incurred for making an error is greater than the loss incurred for being correct, and both factors λ21 − λ11 and λ12 − λ22 are positive. 

Rearranging the rule:

Decide ω1 if 21 − λ11) P(ω1|x ) > (λ12 − λ22) P(ω2|x )

Using the Bayes theorem we write the previous strategy in terms of prior and likelihood: 

Decide ω1 if:

(λ21 − λ11) P(ω1) P(x|ω1) > (λ12 − λ22)P(ω2) P(x|ω2

P(x|ω1) / P(x|ω2) > [ (λ12 − λ22) P(ω2) ] / [21 − λ11) P(ω1) ]

likelihood ratio > quantity independent of x 

ψ(x) > c, where ψ(x) = P(x|ω1) P(x|ω2)

Bayes rule can thus be interpreted as deciding ω1 if the likelihood ratio exceeds a threshold value independent of x

The assumption is that we know the conditional class densities. In a practical setting, we learn likelihood from the training dataset. That is, the threshold c acts as prior, and ψ(x) acts as a classifier whose parameters are to be learned from the data.

Loss functions

Regression tasks use loss functions such as quadratic and linear difference, where there is a natural ordering of the predictions. Predictions that are “more wrong” than others can meaningfully be penalized. The symmetrical or zero-one loss function adopts a more straightforward approach, assigning no loss to a correct decision and assigning a unit loss to any error; thus, all mistakes are equally costly.

With a zero-one loss function, the error R(ai|x ) becomes the sum of the probability that the state is not equal to i or 1 – P(ωi|x ). The Bayes decision rule to minimize risk means selecting the action that minimizes the conditional risk. Hence, to minimize the average probability of error, we should choose the i that maximizes the posterior probability P(ωi|x). In other words, for minimum error rate:

Decide ωi if P(ωi|x ) > P(ωj |x ) for all j not equal to i

This is the same rule seen above with posterior probabilities.

We saw class-conditional probability densities and the posterior probabilities in the red and black lines above for the two-class case. The picture below shows the likelihood ratio p(x|ω1)/p(x|ω2) for the same case. The threshold value θa marked is from the same prior probabilities but with a zero-one loss function.

If we penalize mistakes in classifying ω1 patterns as ω2 more than the other way round, or λ21 > λ12, the threshold θb results. Naturally, the range of x values for which we classify a pattern as ω1 gets smaller.


Disclaimer:

I currently work full-time at Swiss Re, Bengaluru. The blogs and articles on this website www.balajos.com are the personal posts of myself, Balachandra Joshi, and only contain my personal views, thoughts, and opinions. It is not endorsed by Swiss Re (or any of my formal employers), nor does it constitute any official communication of Swiss Re.

Also, please note that the opinions, views, comprehensions, impressions, deductions, etc., are my takes on the vast resources I am lucky to have encountered. No individuals or entities, including the Indian Institute of Science and NSE Talent Sprint who have shown me where to research, or the actuarial professional bodies that provide me continuous professional growth support, are responsible for any of these views; and these musings do not by any stretch of imagination represent their official stands; and they may not subscribe/support/confirm any of these views and hence can be held liable in any vicarious way. All the information in the public space is shared to share the knowledge without any commercial advantages.