Machine Learning Glossary

Introduction

The goal of this post is to briefly explain popular (and unpopular) concepts in Machine Learning, the idea for which stemmed from my travails for finding good quality explanations of various Machine Learning concepts on the web. Unlike similar posts on the web, here you’ll also find links to good quality resources and to related concepts for more holistic understanding. Hopefully, this post would be helpful to the people who are just starting in Machine Learning as well as to the people who need a quick refresher on some concepts.

Didn’t find what you were looking for? Consider contributing by creating a pull request on this post here.

A . B . C . D . E . F . G . H . I . J . K . L . M . N . O . P . Q . R . S . T . U . V . W . X . Y . Z

A

• AUC: AUC is the Area Under the Receiver Operating Characteristic (ROC) Curve. ROC curve is obtained by varying the classification threshold of a binary classifier and plotting the true positive rate (TPR) against the false positive rate (FPR) at each threshold. It is a popular classification performance metric and has several nice properties like being independent of decision threshold, being robust to the class imbalance in data and so on.

B

• Bagging: Bagging is a procedure that produces several different training sets of the same size with replacement and then trains a machine learning model for each set. The predictions are produced by taking a majority vote in a classification task and by averaging in a regression task. Bagging helps in reducing variance from models.
• Bias-Variance Trade-off: Bias here refers to the difference between the average prediction of a model and target value the model is trying to predict. Variance refers to the variability in the model predictions for a given data point because of its sensitivity to small fluctuations in the training set. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand, if our model has a large number of parameters, then it may have high variance and low bias. Thus, we need to find the right/good balance between bias and variance without overfitting and underfitting the data.
• Bootstrapping: Bootstrapping is the process of dividing the dataset into multiple subsets, with replacement. Each subset is of the same size of the dataset and the samples are called bootstrap samples. It is used in bagging.
• Boosting: Boosting is an ensemble method for improving the model predictions of any given learning algorithm. The idea is to train weak learners sequentially, each trying to correct its predecessor, to build strong learners. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

E

• Elastic Net Regression:
• Entropy:
• Error Analysis:
• Expectation Maximization: Expectation-Maximization (EM) algorithm is a way to find maximum likelihood estimates for model parameters when the data is incomplete, has missing data points, or has unobserved (hidden) latent variables. It uses an iterative approach to approximate the maximum likelihood function.

F

• False Positive Rate: The false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive (false positives) and the total number of actual negative events (regardless of classification).
• Feature Selection:

G

• Generative Classifiers:
• Gradient Descent: Gradient Descent is an optimization technique to minimize a loss function by computing the gradients of the loss function with respect to the model’s parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters and gradually finding the best combination to minimize the loss.
• Grid Search:

H

• Hinge Loss: Hinge loss is used in context of classification problems and is defined as $l(y) = max(0, 1 - t.y)$, where t is the actual output and y is the classifier’s score. Observing the function, we can see that classifier is penalized unless it classifies data points correctly with 100% confidence. This leads to “maximum-margin” classification where each training data point is as far from classifier’s decision boundary as possible.

J

• Jaccard Similarity: Jaccard Similarity is a statistic used for comparing the similarity and diversity of finite sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets $\left(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\right)$.

K

• K-Nearest Neighbor: KNN is essentially a classification technique that finds the ($K$) data points in the training data which are most similar to an unseen data point, and takes a majority vote to make classifications. KNN is a non-parametric method which means that it does not make any assumptions on the underlying data distribution. Performance of KNN methods depends on the data representation and the definition of closeness/similarity.
• Kullback–Leibler Divergence: Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution. A familiar use case for this is when we replace observed data or complex distributions with a simpler approximating distribution, we can use KL Divergence to measure just how much information we lose when we choose an approximation.

L

• Lasso Regression:
• Learning Curve:
• Linear Discriminant Analysis:
• Linear Regression: Linear regression models linear relationship between a scalar dependent variable (usually called target) and several independent variables (usually called predictors). It can be used for forecasting outcomes once the model parameters are learned using supervision from a relevant dataset. Additionally, the learned model parameters can also be used to explain the strength of the relationship between the target and the predictors (a procedure known as linear regression analysis). The model parameters are usually learned by minimizing mean squared error.
• Logistic Regression: Logistic regression models the probability of a certain binary outcome given some predictor variables which influence the outcome. It uses a linear function on predictor variables like linear regression but then transforms it into a probability using the logistic function $\left( \sigma(z) = \frac{1}{1 + e^{-z}} \right)$. The model parameters are usually learned by maximizing likelihood of observed data.

M

• Maximum Likelihood Estimation: Maximum likelihood estimation is a method of estimating the parameters of a statistical model $\theta$ such that the likelihood function $L(\theta; x)$, which is a function of model parameters given observed data $x$, is maximized. Intuitively, this selects the parameters $\theta$ that make the observed data most probable.
• Model Selection:

O

• Ordinal Classification: Same as Ordinal Regression.
• Ordinal Regression: Ordinal Regression is used for predicting ordinal outcomes, i.e. whose value exists on an arbitrary scale where only the relative ordering between different values is significant, based on various predictor variables. That is why, it is considered as an intermediate problem between regression and classification. Usually ordinal regression problem is reduced to multiple binary classification problems with the help of threshold parameters such that classifier’s score falling within certain threshold correspond to one of the ordinal outcomes.

P

• Pearson’s Correlation Coefficient: Correlation coefficient ($\rho$) ranges from -1 to +1. The closer $\rho$ is to +1 or -1, the more closely the two variables are related and if it is close to 0, the variables have no relation with each other. It is defined as $\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_{X}.\sigma_{Y}}$.
• Precision: If we are given a set of instances, precision is the fraction of relevant instances (those correctly classified into a certain class $C$) among the retrieved instances (those belonging to a certain class $C$). A perfect precision score of 1.0 means that every result retrieved by a search was relevant, but says nothing about whether all relevant documents were retrieved.
• Principle Component Analysis: PCA is a statistical procedure that transforms a set of observations of possibly correlated variables into a set of observations with linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component variance in decreasing order with the constraint that it is orthogonal to the preceding components. Utilizing only a few components that capture most of the variance in data helps in fighting the Curse of Dimensionality.
• Pruning:

R

• $R^2$:
• Random Forest: Random Forest is a supervised learning algorithm that builds an ensemble of Decision Trees, where each decision tree is allowed to use a fixed number of randomly chosen features. The decision trees are trained using the Bagging technique and the output of trees are merged together to get a more accurate and stable prediction.
• Recall: If we are given a set of instances, recall is the fraction of relevant instances (belonging to a certain class $C$) that have been retrieved (or correctly classified in $C$) over the total number of relevant instances. A recall of 1.0 means that every item from class $C$ was labeled as belonging to class $C$, but does not say anything about other items that were incorrectly labeled as belonging to class $C$.
• Regression: Regression is the problem of approximating a mapping function ($f$) from input variables ($X$) to a continuous output variable ($y$), on the basis of a training set of data containing observations in the form of input-output pairs.
• Relative Entropy: See Kullback–Leibler Divergence.
• Ridge Regression:

S

• Sensitivity: Same as Recall.
• Specificity: If we are given a set of instances, specificity measures the proportion of actual negatives (instances not belonging to a particular class) that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).
• Standard Score: Same as Z-score.
• Standard Error:
• Stratified Cross Validation:
• Supervised Learning: Supervised learning is a task of learning a function that can map an unseen input to an output as accurately as possible based on the example input-output pairs known as training data.
• Support Vector Machine: Support Vector Machine (SVM), in simplest terms, is a classification algorithm which aims to find a decision boundary that separates two classes such that the closest data points from either class are as far as possible. Having a good margin between the two classes contributes to robustness and generalizability of SVM.

T

• T-Test: The t-test is one type of inferential statistics that is used to determine whether there is a significant difference between the means of two groups. T-test assumes that the two groups follow a normal distribution and calculates the t-value (extension of z-score), which is inextricably linked to certain probability value (p-value). P-value denotes the likelihood of two distribution being the same and if the value is below a certain agreed-upon threshold, t-test concludes that the two distributions are sufficiently different.
• True Positive Rate: Same as Recall.
• True Negative Rate: Same as Specificity.