Machine Learning Glossary

## Introduction

The goal of this post is to briefly explain popular (and unpopular) concepts in Machine Learning, the idea for which stemmed from my travails for finding good quality explanations of various Machine Learning concepts on the web. Unlike similar posts on the web, here you’ll also find links to good quality resources and to related concepts for more holistic understanding. Hopefully, this post would be helpful to the people who are just starting in Machine Learning as well as to the people who need a quick refresher on some concepts.

**Didn’t find what you were looking for? Consider contributing by creating a pull request on this post here**.

## Jump to

A . B . C . D . E . F . G . H . I . J . K . L . M . N . O . P . Q . R . S . T . U . V . W . X . Y . Z

## A

**AUC**: AUC is the**A**rea**U**nder the Receiver Operating Characteristic (ROC)**C**urve. ROC curve is obtained by varying the classification threshold of a binary classifier and plotting the true positive rate (TPR) against the false positive rate (FPR) at each threhold. It is a popular classification performance metric and has several nice properties like being independent of decision threshold, being robust to class imbalance in data and so on.- Useful links: Video Explanation of AUC | Probabilistic interpretation of AUC

## B

**Bagging**: Bagging is a procedure that produces several different training sets of the same size with replacement and then trains a machine learning model for each set. The predictions are produced by taking majority vote in a classification task and by averaging in a regression task. Bagging helps in reducing variance from models.- Also see: Random Forest
- Useful links: Video explanation by Udacity | Blog post on Medium

**Bias Variance Trade-off**: Bias here refers to the difference between average prediction of a model and target value the model is trying to predict. Variance refers to the variability in the model predictions for a given data point because of its sensitivity to small fluctuations in the training set. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters, then it may have high variance and low bias. Thus, we need to find the right/good balance between bias and variance without overfitting and underfitting the data.- Useful links: Video explanation by Trevor Hastie | Blog post on towardsdatascience

**Boosting**: Boosting is an ensemble method for improving the model predictions of any given learning algorithm. The idea is to train weak learners sequentially, each trying to correct its predecessor, to build strong learners. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.- Also see: Bagging
- Useful links: Lecture by Patrick Winston | Boosting wiki

## C

**Classification**: Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known.- Also see: Boosting | Decision Trees | K-Nearest Neighbor | Logistic Regression | Random Forest | Naive Bayes Classifier
- Useful links: Classification Wiki

**Correlation**: Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. Pearson’s Correlation Coefficient is used to measure the strength of correlation between two variables.- Useful links: Blog post on Correlation | Detailed Explanation of Correlation
- Useful links: Blog post by surveysystem

**Curse of Dimensionality**: In a model, as the number of features or dimensions grows, the amount of data needed to make the model generalizable with good performance grows exponentially, which unnecessarily increases storage space and processing time for a modeling algorithm. In this sense, value added by an additional dimension becomes much smaller compared to overhead it adds to the algorithm.- Also see: Dimensionality Reduction
- Useful links: Video explanation by Trevor Hastie | Elaborate post on Medium

## D

**Decision Tree**: A Decision Tree can be used to visually and explicitly represent decisions and decision making. Each non-leaf node in the tree represents a decision based on one of the features in the dataset. Leaves of the trees represent the final output after a series of decisions; for classification, output is class membership based on majority vote from node members and for regression, output is the average value of node members. The feature used to make decision at each step is chosen such that the information gain is maximized.- Also see: Boosting | Random Forest
- Useful links: Video Lecture by Patrick Winston | Blog post on towardsdatascience

**Dimensionality Reduction**: The goal of dimensionality reduction methods is to find a low-dimensional representation of the data that retains as much information as possible. This low-dimensional data representation in turn helps in fighting the Curse of Dimensionality.- Also see: Principle Component Analysis
- Useful links: Video Explanation by Robert Tibshirani | Blog post on towardsdatascience

## E

**Expectation Maximization**: Expectation-Maximization (EM) algorithm is a way to find maximum likelihood estimates for model parameters when the data is incomplete, has missing data points, or has unobserved (hidden) latent variables. It uses an iterative approach to approximate the maximum likelihood function.- Useful links: Introductory blog post by me | Advanced blog post by me

## F

**False Positive Rate**: The false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive (false positives) and the total number of actual negative events (regardless of classification).- Useful links: False Positive Rate Wiki

## G

**Gradient Descent**: Gradient Descent is an optimization technique to minimize a loss function by computing the gradients of the loss function with respect to the model’s parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters and gradually finding the best combination to minimize the loss.- Useful links: Blog post on towardsdatascience | Blog post on kdnuggets

## H

**Hinge Loss**: Hinge loss is used in context of classification problems and is defined as $l(y) = max(0, 1 - t.y)$, where t is the actual output and y is the classifier’s score. Observing the function, we can see that classifier is penalized unless it classifies data points correctly with 100% confidence. This leads to “maximum-margin” classification where each training data point is as far from classifier’s decision boundary as possible.- Also see: Support Vector Machines
- Useful links: Hinge Loss Wiki

## I

**Information Gain**: See Kullback–Leibler Divergence.

## J

**Jaccard Similarity**: Jaccard Similarity is a statistic used for comparing the similarity and diversity of finite sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets $\left(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\right)$.- Also see: Correlation
- Useful links: Jaccard Similarity Wiki | Explanation with examples

## K

**K-Nearest Neighbor**: KNN is essentially a classification technique that finds the ($K$) data points in the training data which are most similar to an unseen data point, and takes majority vote to make classifications. KNN is a non-parametric method which means that it does not make any assumptions on the underlying data distribution. Performance of KNN methods depend on the data representation and the definition of closeness/similarity.- Useful links: Video explanation by Trevor Hastie | Blog post on Medium

**Kullback–Leibler Divergence**: Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution. A familiar use case for this is when we replace observed data or a complex distributions with a simpler approximating distribution, we can use KL Divergence to measure just how much information we lose when we choose an approximation.- Useful links: Blog post on countbayesie | Blog post on towardsdatascience

## L

**Linear Regression**: Linear regression models linear relationship between a scalar dependent variable (usually called target) and several independent variables (usually called predictors). It can be used for forecasting outcomes once the model parameters are learned using supervision from a relevant dataset. Additionally, the learned model parameters can also be used to explain the strength of the relationship between the target and the predictors (procedure known as linear regression analysis). The model parameters are usually learned by minimizing mean squared error.- Useful links: Video playlist from Stanford | Blog post on towardsdatascience

**Logistic Regression**: Logistic regression models the probability of a certain binary outcome given some predictor variables which influence the outcome. It uses a linear function on predictor variables like linear regression but then transforms it into a probability using the logistic function ($\sigma(z) = \frac{1}{1 + e^{-z}}$). The model parameters are usually learned by maximizing likelihood of observed data.- Also see: Maximum Likelihood Estimation
- Useful links: Video explanation by Trevor Hastie | Blog post on towardsdatascience

## M

**Maximum Likelihood Estimation**: Maximum likelihood estimation is a method of estimating the parameters of a statistical model $\theta$ such that the likelihood function $L(\theta; x)$, which is a function of model parameters given observed data $x$, is maximized. Intuitively, this selects the parameters $\theta$ that make the observed data most probable.- Useful links: Video explanation by Trevor Hastie | Blog post on towardsdatascience

## N

**Naive Bayes Classifier**: Naive Bayes Classifier is based on Bayes’ Theorem. It assumes that the presence of a particular feature in a class is unrelated with the presence of any other feature and they all independently contribute towards the class probability.- Useful links: Video Explanation by Trevor Hastie | Blog post on towardsdatascience

## O

**Ordinal Classification**: Same as Ordinal Regression.**Ordinal Regression**: Ordinal Regression is used for predicting ordinal outcomes, i.e. whose value exists on an arbitrary scale where only the relative ordering between different values is significant, based on various predictor variables. That is why, it is considered as an intermediate problem between regression and classification. Usually ordinal regression problem is reduced to multiple binary classification problems with the help of threshold parameters such that classifier’s score falling within certain threshold correspond to one of the ordinal outcomes.

## P

**Pearson’s Correlation Coefficient**: Correlation coefficient ($\rho$) ranges from -1 to +1. The closer $\rho$ is to +1 or -1, the more closely the two variables are related and if it is close to 0, the variables have no relation with each other. It is defined as $\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_{X}.\sigma_{Y}}$.- Useful links: Pearson Correlation Wiki

**Precision**: If we are given a set of instances, precision is the fraction of relevant instances (those correctly classified into a certain class $C$) among the retrieved instances (those belonging to a certain class $C$). A perfect precision score of 1.0 means that every result retrieved by a search was relevant, but says nothing about whether all relevant documents were retrieved.- Also see: Recall
- Useful links: Blog post on towardsdatascience | Precision and Recall Wiki

**Principle Component Analysis**: PCA is a statistical procedure that transforms a set of observations of possibly correlated variables into a set of observations with linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component variance in decreasing order with the constraint that it is orthogonal to the preceding components. Utilizing only few components that capture most of the variance in data helps in fighting the Curse of Dimensionality.

## Q

## R

**Random Forest**: Random Forest is a supervised learning algorithm that builds an ensemble of Decision Trees, where each decision tree is allowed to use fixed number of randomly chosen features. The decision trees are trained using the Bagging technique and the output of trees are merged together to get a more accurate and stable prediction.- Also see: Boosting
- Useful links: Blog post on towardsdatascience | Blog post on Medium

**Recall**: If we are given a set of instances, recall is the fraction of relevant instances (belonging to a certain class $C$) that have been retrieved (or correctly classified in $C$) over the total number of relevant instances. A recall of 1.0 means that every item from class $C$ was labeled as belonging to class $C$, but does not say anything about other items that were incorrectly labeled as belonging to class $C$.- Also see: Precision
- Useful links: Blog post on towardsdatascience | Precision and Recall Wiki

**Regression**: Regression is the problem of approximating a mapping function ($f$) from input variables ($X$) to a continuous output variable ($y$), on the basis of a training set of data containing observations in the form of input-output pairs.- Also see: Linear Regression
- Useful links: Video Explanation by Trevor Hastie

**Relative Entropy**: See Kullback–Leibler Divergence.

## S

**Sensitivity**: Same as Recall.**Specificity**: If we are given a set of instances, specificity measures the proportion of actual negatives (instances not belonging to a particular class) that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).- Useful links: Specificity Wiki.

**Standard Score**: Same as Z-score.**Supervised Learning**: Supervised learning is a task of learning a function that can map an unseen input to an output as accurately as possible based on the example input-output pairs known as training data.- Also see: Classification | Regression
- Useful links: Coursera Video Explanation | Supervised Learning Wiki

**Support Vector Machine**: Support Vector Machine (SVM), in simplest terms, is a classification algorithm which aims to find a decision boundary that separates two classes such that the closest data points from either class are as far as possible. Having a good margin between two classes contributes to robustness and generalizability of SVM.- Also see: Hinge Loss
- Useful links: Blog post by Me | Video Lecture by Patrick Winston

## T

**T-Test**:**True Positive Rate**: Same as Recall.**True Negative Rate**: Same as Specificity.

## U

**Unsupervised Learning**: Unsupervised learning is the task of inferring patterns from data without having any reference to known, or labeled, outcomes. It is generally used for discovering underlying structure of the data.- Also see: Principle Component Analysis
- Useful links: Blog post by Hackernoon | Coursera Video Explanation

## V

## W

## X

## Y

## Z

**Z-score**: Z-score is a measure of how many standard deviations below or above the population mean a raw score is, thus giving us a good picture when we want to compare results from a test to a “normal” population.- Also see: T-Test
- Useful links: Z-score Wiki | Khan Academy tutorial on Z-score