library, books, shelves-922998.jpg

Class 1 – Machine Learning – An actuary’s learning log

Posted by

Young actuaries often are overwhelmed by the established insurance domain practices by the time they pass the specialization subjects close to their qualification. Rightly so, as insurance is inherently a complex subject. There is so much to learn and practice; the emerging AI/ML area is usually isn’t there in priority for most. 

But this isn’t helpful; ultimately, insurance is a data drive industry. Moreover, the explosion of data computing capabilities is creating a wave of advancement in the statistical approach used to leverage data in many sectors. For example, the medical imaging field is close to the actuarial field. 

This learning log provides a learning path for those interested to keep learning and get ahead of the pack. Actuaries can handle and will need more rigorous education in AI/ML, as many core concepts are already learned. The IISc course provided me with good theoretical ground and context, which I can relate to the insurance domain. 

This first blog of the series briefly touches upon hype v/s reality in ML, a different classification of data types than the actuarial field, and general types of models. Most of these are learned in CS1 and CS2; a more rigorous data vector model is introduced. The vector and matrix representation is where the traditional actuarial approach differs in regression and GLMs.

Links to the module learnings are below:


Class 1; 26th June 2022:

  1. Beginning on AI/ML Learning
  2. Machine Learning: Hype and Reality
  3. Real-world data and Models
  4. Representation of Data 
  5. Different types of data
  6. Properties of Data 
  7. Models
  8. Some of the most used AI models
  9. Models in broader areas
  10. Data Representation 
  11. Data: In vector space representation 
  12. Vector-Based similarity measures
  13. A simple decision Rule based on Means
  14. Linear Discriminant Analysis
  15. Probability and random variables


Last month the third module, Machine Learning Basics for Real-World, in the ongoing certification program on Digital Health and Imaging began. I have been looking forward to learning this module. This third and the following two modules, Deep Learning in Digital Health and Deep Learning in Imaging/ Vision, are the main reasons I enrolled for the certification. 

The IISc faculty has very diverse styles of teaching. In the first module, Digital Health – Introduction, Professor Phaneendra covered the overall landscape of Digital Health, providing tons of reference materials. This module triggered a good number of insights I have been blogging about; if you wish to know more, you can follow them here.

The second module was Wearable Devices and Physiological Signal Processing. Again, the basics of signal processing and the physiological processes generating the digital health markers were totally new (ECG, EEG, and other health imaging and more recent wearable structures and constructs). Honestly, I struggled through the module and barely scraped through the test to pass this module. But the concepts are so fascinating that I am sure I will return to the module to explore them further.  

I have been dabbling in AI/ML for a couple of years. Given that my commute to the office takes at least two hours per day, the pandemic gave me one extra day every week! A simple search throws up overwhelming zillions of appealing resources for learning AI/ML. However, suitable learning paths depend upon one’s current maths/linear algebra, coding familiarity, and understanding of data structures. And, of course, patience to persist through the initial stages; it does feel like learning an alien language. Everybody’s needs and where they start from are different; hence discovering the suitable method and tools is quite a task. 

The most common place to start is the MOOC platforms like Udemy or Coursera. They are marketed well; it feels like a shopping spree. One usually ends up purchasing very many courses. Then there are books and websites; I dabbled through Humble Book Bundles, accumulated Springer books, and many GitHub repositories. The Andrew NG’s course and Machine Learning Mastery by Jason Brownlee were most insightful. 

But at the end of two years, this new knowledge somehow felt shallow. The rigor of a purpose that strings through the new learnings was missing. Hence I am going through a more formal certificate program. The course has defined the Digital Health application space, builds the foundations, and develops an understanding of the physiological processes that generate the data, before broaching AI/ML concepts. This approach is very appealing to me with a methodical learning habit from the actuarial field. AI/ML are just tools; one needs a more well-defined space to apply them to be naturally confident about the learning outcomes. 

Actuaries are likely to have an advantage. With exposure to Bayesian statistics, Maximum Likelihood estimators, GLMs, and partial differential equations, the field of AI/ML feel familiar. However, new-age actuaries must learn AI/ML to prepare for the upcoming digital transformation. I hope this structured learning log will help a few of them. The module contains two to three classes for two hours each during the weekends and some reading and coding practices.  

This learning log is likely to be super long! I plan to update this learning log by the week in the order I am learning. You will find many links to the information I have accumulated over time. 

  1. Beginning on AI/ML Learning: 

Machine learning can be viewed as a collection of models/methods with few foundational probabilistic and statistical principles applied to solve practical problems. Therefore, a good learning strategy constantly strengthens the foundations, explores relations between different paradigms and methods, and, most importantly, constantly experiments.

2. Machine Learning: Hype and Reality

Today, machine learning is an iceberg of massive proportions sitting at the top of the Hype Cycle, the Peak of Inflated Expectations. Slowly, bits and pieces of it will get chipped away by the population, falling through a Trough of Disillusionment to be refined into a usable product.

Compared to expectations, ML is slow to take off: Business leaders are skeptical and are right not to jump on board immediately. After all, an estimated 85% of AI projects won’t ship.

Implementing modern ML algorithms is R&D work. This aspect may catch non-technical collaborators off guard, as the rigor and uncertainty of R&D elude linear progression and rigid timelines. This hardship is accentuated using algorithms from the bleeding edge of ML research — often, the most powerful algorithms are also the most time-consuming to train and use. Modern ML has matured enough to be successful outside research environments, but the best practitioners are cautious. Applied ML engineers should give an honest opinion on how appropriate ML is for a given problem and be ready to temper AI oversell.

3. Real-world data and Models 

Machine learning involves learning from data. Data is generally observed/collected from various phenomena in the real world. 

Usually, simplified models are assumed to represent such a phenomenon based on some functional knowledge. However, do note that all models are wrong, but some are useful! 

Data obtained from the real world is used to estimate the model’s parameters; these models are then used for making predictions or gaining insights into the real world. 

4. Representation of Data  

Data is often represented as vectors in real space in data science or machine learning. The motive is to find out the patterns and relationships between observations. Visualizing data as vectors and using vector algebra to manipulate them solves many data handling challenges, especially in natural language processing, text classification, and text analysis.

In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Therefore, choosing informative, discriminating, and independent features is a crucial element of effective algorithms. These measurable properties are then converted into numerical forms, for example, age, name, height, weight, etc. The observations are then placed in a specific order in a table; every column is a feature in a relational table. Feature Vector represents a particular observation (or row) in the relational table. Each row is a vector; row ‘n’ is a feature vector for the ‘n’th sample.

Examples are feature vectors corresponding to a speech signal, a region to predict housing prices, pixels of an image, or to a word or a sentence in natural language text.

The process of converting or transforming a data set into a set of Vectors is called Vectorization. It’s easier to represent data set as vectors where attributes are already numeric. What about textual data? “Word Embedding” is the process of representing words or text as Vectors. There are many techniques, in Count Vectorization total unique words from the collection of text form a ‘corpus’, punctuations usually ignored. This ‘corpus’ then helps is to vectorize the sentences. Read here for a quick intuition.

“We all are vectors. Many of us are still in search of direction”

5. Different types of data 

ML/AI will need data in numbers to take advantage of algebraic techniques. There are several ways to differentiate data. 

Structured Data: They are usually stored in relational databases and can be easily searched using SQL queries. 

Numeric/Quantitative data are represented through discrete or continuous numbers. Usually, there is no need to pre-process to convert them into numbers.

Categorical/Qualitative data is represented through words. Therefore, some pre-processing needs to be done to convert them into numbers. 

Categorical data can be ordinal, with an inherent ordering within the categories. For instance, movie ratings with good, average & bad have a sense of importance attached to the ratings. Therefore, the ranking/ordering must be preserved when converting them into numbers. Label encoding converts these levels from least to best ranking and assigns each number from 0 to n-1. 

If categorical data is nominal, they don’t have any particular order or ranking. The total number of categories is usually finite. If any numbering like label encoding is used, the model will interpret an order of hierarchy, and the results will be inaccurate. Hence one-hot encoding is done, creating a column for each unique category. A value of 1 is assigned if a particular category is present in the row, and 0 otherwise. Sometimes applying one-hot encoding produces a lot of columns and increases the dimensionality of the data. Pooling the categories into smaller related groups will help. 

Unique – Some data like membership id might have a unique value for each sample, and the number of categories is usually large. This data type usually doesn’t contain much predictive capability and is removed during pre-processing.

Relational Data – Tabular data – This is usually the collection of data from multiple data type sources, for example, data collected during surveys. The tabular data consists of multiple features/columns, each of which might have different data types, like numbers, dates, and text. Each feature is converted into a numerical representation.   

Unstructured data: This type of data is usually composed of everything else, including texts, images, videos, speech/audio, time series, etc. 

Sequential Data – Time series data is usually already in numbers, as a sequence of ordered data points, each having a timestamp. The time associated is an inherent independent variable associated with this data.

Sentences – Text data is composed of multiple words occurring such that they make sense as a whole. Text can be converted into numerical representation using many techniques mentioned above.

Spatially Regular Data – Images are usually grids formed with smaller units known as pixels. For example, if the image is said to be 28×28, there are 784 pixels with 28 pixels in width and 28 pixels in height. The video consists of multiple frames/images. The video’s number of frames in a second is called the frame rate. If we assume each frame is a 28×28 image, then the shape of each frame will be 28x28x3. The video data is then the combination of all the frames with dimensions 30x28x28x3, where 30 corresponds to the number of frames, 28×28 Image size, and 3 is the number of channels (RGB). Each data point records the intensity on a scale of 1 to 256. 

6. Properties of Data 

In the ML/AI context, data comes with 5Vs. 

  • Volume: Scale of Data. With the growing world population and technological exposure, colossal data is generated every millisecond.
  • Variety: Different forms of data – healthcare, images, videos, audio clippings.
  • Velocity: Rate of data streaming and generation.
  • Value: Meaningfulness of data in terms of information that researchers can infer.
  • Veracity: Certainty and correctness in the available data.

Data quality is crucial to serving its purpose in a particular context (such as data analysis, for example). Desired data quality characteristics are accuracy, completeness, reliability, relevance, and timeliness.

7. Models 

A model is an abstraction of the real world, an informative representation of an object, person, or system. In ML/AI, a model is a tool or algorithm based on a specific data set through which it can arrive at a decision. A highly complex model is usually of no use; a model should be flexible enough to represent a phenomenon of interest. It also needs to be tractable.

Both AI and ML are part of data science, contributing to creating intelligent systems. However, AI is a larger concept associated with building machines that can simulate human behavior and intelligence. On the other hand, ML is a subset within AI associated with providing machines the ability to learn from experience without the need to be programmed explicitly. So, while all ML models are by default AI models, the opposite may not always be true.

Pic: https://tvst.arvojournals.org/article.aspx?articleid=2762344

8. Some of the most used AI models

Linear Regression – e.g., Linear Gaussian model

Used extensively in statistics, Linear Regression is a model based on supervised learning, finding relationships between the input and output variables. Once relations are known, the model attempts to predict the value of a dependent variable for a new observation. Linear regression models are widely used in various industries, including banking, healthcare, insurance, etc.

Logistic Regression – Similar to the Linear regression model, this model is the preferred method for solving binary classification problems. It is a statistical model that can predict the class of the dependent variable from the set of given independent variables. Linear Discriminant Analysis or LDA is a branch of the Logistic Regression model. This model is usually used when two or more classes are separated in the output. This model is helpful in computer vision, medicine, etc.

Naive Bayes – Naive Bayes is a simple yet effective AI helpful model for solving many complicated problems. It is based on the Bayes Theorem and is mainly applied for classification. The model works on the assumption that the features are independent. Since this assumption is rarely valid, the model is called ‘naive’. This model is used for both binary and multiple-class classifications. Some of its applications include medical data classification and spam filtering. 

Support Vector Machines – SVM is a quick and efficient model for analyzing limited amounts of data, applicable for binary classification problems. Compared to newer technologies such as artificial neural networks, SVM is faster and performs better with a dataset of limited samples – such as in text classification problems. This model is a supervised ML algorithm that can be used for classification, outlier detection, and regression problems.

Decision Trees models arrive at a conclusion based on the data from past decisions. A simple, efficient, and extremely popular model, Decision Tree is named so because the way the data is divided into smaller portions resembles the structure of a tree. This model can be applied for both regression and classification problems.

Learning Vector Quantization or LVQ is a type of Artificial Neural Network that works on the winner-takes-all principle. It processes information by preparing a set of codebook vectors that are then used to classify other unseen vectors. It is used for solving multi-class classification problems.

K-nearest Neighbours or kNN model is a simple supervised ML model for solving regression and classification problems. This algorithm assumes that similar things (data) exist near each other. While it is a powerful model, one of its major disadvantages is that the speed slows down with increased data volume.

Random Forest is a valuable ensemble learning model for solving regression and classification problems. It operates using multiple decision trees and makes the final prediction using the bagging method. To simplify, it builds a ‘forest’ with numerous decision trees, each trained on different data subsets, and merges the results to come up with more accurate predictions.

Deep Neural Networks – Inspired by the human brain’s neural network, these are similarly based on interconnected units known as artificial neurons. A neural network consists of several connected units called nodes in analogy to a neuron. But, unfortunately, the analogy ends there; the human neuron is way more incomprehensible

When a neuron receives a signal, it triggers a process. The signal is passed from one neuron to another based on the input received. A complex network is formed that learns from feedback. The nodes are grouped into layers. Then, a task is solved by processing the various layers between the input and output layers. The greater the number of layers to be processed, the deeper the network, therefore the term, deep learning. 

DNN models find application in several areas, including speech recognition, image recognition, and natural language processing. 

9. Models in broader areas 

Clustering I Hidden Markov model – Discrete valued time series I Linear dynamical system – Continuous valued time series I Restricted Boltzmann machines – Data with latent variables I Stochastic Blockmodels – Networks I, etc.

Hidden Markov Models (HMMs) are a class of probabilistic graphical models that allow us to predict a sequence of unknown (hidden) variables from a set of observed variables. An example of an HMM is predicting the weather (hidden variable) based on the type of clothes that someone wears (observed). 

It is called a Hidden Markov Model because we are constructing an inference model based on the assumptions of a Markov process. The Markov process assumption is that the “future is independent of the past given the present”. In other words, assuming we know our present state, we do not need any additional historical information to predict the future state.

Hidden Markov models are known for their applications to thermodynamics, statistical mechanics, physics, chemistry, economics, finance, signal processing, information theory, and pattern recognition – such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges, and bioinformatics.

10. Data Representation

Data from the real world is always “raw”. Most machine learning models work only when the nice, appropriate, and useful features are fed. Hence features are to be learned or extracted. 

Learned: The model/algorithms automatically learn the useful features.

Extracted: Hand-crafted features defined by a domain expert.

In MI/ML, models are trained with pre-processed data; data is converted into numerical forms to present a structured feature list. In supervised learning, the learning data comes as a set of input-output pairs,

    \[   \mathbf{\left { \left ( x_{n}, y_{n} \right ) \right }}_{n=1}^{N}\]

whereas unsupervised learning work on just the observations.

    \[  \mathbf{\left { \left ( x_{n} \right ) \right }}_{n=1}^{N}\]

Each input xn is usually a D dimensional feature vector, or having features 1, 2, 3, …, D. For example, if xn is a 7 × 7 pixel image, it is ‘flattened’ and represented using a vector of size 49 of pixel intensities.

Output yn can be real values (eg. regression), categorical (eg. classification), or structured object (eg. structured output learning), representing the observed outcome.

In certain applications like preteen sequences, input xn need not be a fixed length of a vector. The learning task becomes tougher and tougher when the dimensionality of data is very high.

11. Data: In vector space representation 

Input data for ML/AI models are represented as vectors in multi-dimensional vector space; mentally, constructing visuals beyond three dimensions is challenging. 

A vector can be represented as a column matrix and, by transposing it, into a row matrix. All three representations are equivalent.

\left ( 1, 3, 6, 8, 12 \right ), and \begin{bmatrix} 1\\ 3\\ 6\\ 8\\ 12 \end{bmatrix} are same with the transpose \begin{bmatrix}1 & 3 & 6 & 8 & 12 \end{bmatrix}.

Each feature vector \mathbf{x_{n} \in \mathbb{R}^{D \times 1}} is a point in the D dimensional vector space \mathbb{R}^{D}.

By putting data in a vector space, we can incorporate all tools provided by Linear Algebra in our problem solving; more importantly, matrix computations play an essential role in machine learning.

12. Vector-Based similarity measures

Vector space provides us with distance and similarity measures. 

Distance measures play an essential role in machine learning. For example, they provide the foundation for many popular and effective machine learning algorithms like k-nearest neighbors for supervised learning and k-means clustering for unsupervised learning.

A distance measure is an objective score that summarizes the relative difference between two objects in a problem domain. Most commonly, the two objects are rows of data that describe a subject (such as a person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).

In the KNN algorithm, a classification or regression prediction is made for new examples by calculating the distance between the new example (row) and all examples (rows) in the training dataset. The k examples in the training dataset with the smallest distance are then selected, and a prediction is made by averaging the outcome (mode of the class label or mean of the real value for regression).

Most commonly used distance measures in machine learning are  Hamming DistanceEuclidean DistanceManhattan Distance, and Minkowski Distance.

The Euclidean distance formula gives the distance between two points (or) the straight line distance. Euclidean distance between x_{n}, x_{m} \in \mathbb{R}^{D} (two observations with D features)

\mathbf{d\left ( x_{n}, x_{m} \right ) = \left | x_{n} - x_{m} \right | = \sqrt{\left ( x_{n} - x_{m} \right )^{T}\left ( x_{n} - x_{m} \right )} = \sqrt{\sum_{d=1}^{D}\left ( x_{n_{d}} - x_{m_{d}} \right )^{2}}}

The formula is driven by the Pythagoras theorem. Consider a 2-D distance to visualize: 

The vector transpose helps to obtain the square. For example, consider two points in 2D space:

x = \left (5, 6 \right ) and y = \left ( 3, 2 \right ); Then \left ( x - y \right ) = \left ( 5 - 3, 6 - 2 \right )

Expressed as row vector \left ( x - y \right ) = \begin{bmatrix} 5 - 3 & 6 -2 \end{bmatrix} and its transpose \left ( x - y \right )^{T} = \begin{bmatrix} 5 -3 \\  6 - 2 \end{bmatrix}.

Hence you can see that \left ( x -y \right )^{T} \left ( x - y \right ) = \left ( 5 - 3 \right )^{2} + \left ( 6 - 2 \right )^{2}

Cosine similarity is a measure of similarity between two sequences of numbers. Cosine similarity is the dot product of the vectors divided by the product of their lengths; this is nothing but the cosine of the angle between them.  

The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0,π] radians. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors oriented at 90° relative to each other have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

Cosine similariy between x_{n}, x_{m} \in \mathbb{R}^{D} can be talen as

\mathbf{\left \langle x_{n}, x_{m} \right \rangle = x_{n}^{T}x_{m} = \sum_{d=1}^{D}x_{nd} x_{md} }

Note both the vectors are of dimensions 1 X D, hence providing the dot product as shown in the summation.

13. A simple decision Rule based on Means

The advantage of matrix representation can be demonstrated by a simple decision rule setup. If we have two groups/clusters of data, a simple rule for a new observation can be to group it with the nearest group. The closeness can be taken as the distance between the means of the two groups.

Consider N labelled training examples \left \{ x_{n}, y_{n} \right \}_{n=1}^{N} from two classes (+ve and -ve). We have N+ examples from +ve class and N examples from-ve class.

Rule: Assign test sample to class with the closest mean.

Let \mu _{-} and \mu _{+} be means can be calculated from the N+ examples and N− examples. Once we calculate, we don’t need the observations anymore. Using the above-mentioned Euclidean distance, square of distances of new observation (x, y) is given by:

\left \| \mu_{-} - x\right \|^{2} = \mu_{-} ^{2} + x^{2} - 2\mu_{+} x = \left \| \mu _{-} \right \|^{2} + \left \| x^{} \right \|^{2} - 2\left \langle \mu _{-}, x \right \rangle

\left \| \mu_{+} - x\right \|^{2} = \mu_{+} ^{2} + x^{2} - 2\mu_{+} x = \left \| \mu _{+} \right \|^{2} + \left \| x^{} \right \|^{2} - 2\left \langle \mu _{+}, x \right \rangle

Note \left \langle a, b \right \rangle denotes the inner product of the vectors and

\left \| a \right \|^{2} = \left \langle a, a \right \rangle denotes squared norm of a.

If we define the rule f : \chi \rightarrow \left \{ +1, -1 \right \} as:

\mathbf{f\left ( x \right ) = \left \| \mu _{-} - x \right \|^{2} - \left \| \mu _{+} - x \right \|^{2} = 2\left \langle \mu _{+} - \mu _{-}, x \right \rangle + \left \| \mu _{-} \right \|^{2} - \left \| \mu _{+} \right \|^{2} }

Our decision rule can be: if f(x) > 0 then x is in +1; otherwise x is in -1.

Or y = sign[f(x)]

This simpler rule provides a useful starting point for the classification, this specific form of decision rule appears in many supervised algorithms.

I have to admit, I struggled with conceptualizing that the f(x) represents a hyperplane, where w = µ+ − µ− represents the direction rule to the hyperplane. I will need a proper revisit of vector space concepts, and a follow-up blog.  If you are curious, please read here and here.

This easy-to-implement rule will need a large amount of training data for each class to reliably obtain a decent estimate of the means. Also, if we have class-imbalanced data (classes with skewed proportions), this decision rule will not work. For nonlinear decision boundaries, the Euclidean distance can be replaced by the nonlinear distance function. Kernels use a mapping function to project nonlinear combinations of the original features onto a higher-dimensional space, where the data becomes linearly separable.

14. Linear Discriminant Analysis

Instead of distance, the inner product can be replaced by more general similarity measures. For example, we can assume underlying probability distributions and related statistics like mean and variance for different classes. Class conditional probability distributions will be more informative than the distance from the mean. For example, instead of the shortest distance of the new observation to a class mean, we can estimate the maximum probability of the new observation. 

Linear Discriminant Analysis (LDA) is one such method. LDA makes some simplifying assumptions about the data:

  • Data is Gaussian, and each variable is shaped like a bell curve when plotted. 
  • Each attribute has the same variance; the values of each variable vary around the mean by the same amount on average.

With these assumptions, the LDA model estimates the mean and variance for each class. Then predictions are made by estimating the probability that a new set of inputs belongs to each class by choosing the class that gets the highest probability.

15. Probability and random variables

Basics of probability, marginal and conditional probability, Bayes Formula, random variables, expected values, and variances are introduced. Since these concepts are introduced to actuaries at an early stage, they are not summarized here.

Bayes Formula is quite versatile; even after a long time of knowing the formulation, the class of problems it helps is impressive. I have been collecting the interesting applications separately since the beginning of this IISc course; it is worth summarizing in a separate blog.


Disclaimer:

I currently work full-time at Swiss Re, Bengaluru. The blogs and articles on this website www.balajos.com are the personal posts of myself, Balachandra Joshi, and only contain my personal views, thoughts, and opinions. It is not endorsed by Swiss Re (or any of my formal employers), nor does it constitute any official communication of Swiss Re.

Also, please note that the opinions, views, comprehensions, impressions, deductions, etc., are my takes on the vast resources I am lucky to have encountered. No individuals or entities, including the Indian Institute of Science and NSE Talent Sprint who have shown me where to research, or the actuarial professional bodies that provide me continuous professional growth support, are responsible for any of these views; and these musings do not by any stretch of imagination represent their official stands; and they may not subscribe/support/confirm any of these views and hence can be held liable in any vicarious way. All the information in the public space is shared to share the knowledge without any commercial advantages.