Machine Learning Glossary
Introduction
The goal of this post is to briefly explain popular (and unpopular) concepts in Machine Learning, the idea for which stemmed from my travails for finding good quality explanations of various Machine Learning concepts on the web. Unlike similar posts on the web, here you'll also find links to good quality resources and to related concepts for more holistic understanding. Hopefully, this post would be helpful to the people who are just starting in Machine Learning as well as to the people who need a quick refresher on some concepts.
Didn't find what you were looking for? Consider contributing by creating a pull request on this post here.
Jump to
A . B . C . D . E . F . G . H . I . J . K . L . M . N . O . P . Q . R . S . T . U . V . W . X . Y . Z
A
 AUC: AUC is the Area Under the Receiver Operating Characteristic (ROC) Curve. ROC curve is obtained by varying the classification threshold of a binary classifier and plotting the true positive rate (TPR) against the false positive rate (FPR) at each threshold. It is a popular classification performance metric and has several nice properties like being independent of decision threshold, being robust to the class imbalance in data and so on.
 Useful links: Video Explanation of AUC  Probabilistic interpretation of AUC
B

Bagging: Bagging is a procedure that produces several different training sets of the same size with replacement and then trains a machine learning model for each set. The predictions are produced by taking a majority vote in a classification task and by averaging in a regression task. Bagging helps in reducing variance from models.
 Also see: Random Forest
 Useful links: Video explanation by Udacity  Blog post on Medium

BiasVariance Tradeoff: Bias here refers to the difference between the average prediction of a model and target value the model is trying to predict. Variance refers to the variability in the model predictions for a given data point because of its sensitivity to small fluctuations in the training set. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand, if our model has a large number of parameters, then it may have high variance and low bias. Thus, we need to find the right/good balance between bias and variance without overfitting and underfitting the data.
 Useful links: Video explanation by Trevor Hastie  Blog post on towardsdatascience

Bootstrapping: Bootstrapping is the process of dividing the dataset into multiple subsets, with replacement. Each subset is of the same size of the dataset and the samples are called bootstrap samples. It is used in bagging.
 Also see: Bagging
 Useful links: Bootstrapping wiki  Blog post by machinelearningmastery

Boosting: Boosting is an ensemble method for improving the model predictions of any given learning algorithm. The idea is to train weak learners sequentially, each trying to correct its predecessor, to build strong learners. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily wellcorrelated with the true classification.
 Also see: Bagging
 Useful links: Lecture by Patrick Winston  Boosting wiki
C
 Classification: Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known.
 Also see: Boosting  Decision Trees  KNearest Neighbor  Logistic Regression  Random Forest  Naive Bayes Classifier
 Useful links: Classification Wiki
 Confusion Matrix:
 Correlation: Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. Pearson’s Correlation Coefficient is used to measure the strength of correlation between two variables.
 Useful links: Blog post on Correlation  Detailed Explanation of Correlation
 Useful links: Blog post by surveysystem
 Cross Validation:
 Curse of Dimensionality: In a model, as the number of features or dimensions grows, the amount of data needed to make the model generalizable with good performance grows exponentially, which unnecessarily increases storage space and processing time for a modeling algorithm. In this sense, value added by an additional dimension becomes much smaller compared to overhead it adds to the algorithm.
 Also see: Dimensionality Reduction
 Useful links: Video explanation by Trevor Hastie  Elaborate post on Medium
D
 Decision Tree: A Decision Tree can be used to visually and explicitly represent decisions and decision making. Each nonleaf node in the tree represents a decision based on one of the features in the dataset. Leaves of the trees represent the final output after a series of decisions; for classification, the output is class membership based on a majority vote from node members and for regression, the output is the average value of node members. The feature used to make a decision at each step is chosen such that the information gain is maximized.
 Also see: Boosting  Random Forest
 Useful links: Video Lecture by Patrick Winston  Blog post on towardsdatascience
 Dimensionality Reduction: The goal of dimensionality reduction methods is to find a lowdimensional representation of the data that retains as much information as possible. This lowdimensional data representation in turn helps in fighting the Curse of Dimensionality.
 Also see: Principle Component Analysis
 Useful links: Video Explanation by Robert Tibshirani  Blog post on towardsdatascience
 Discriminative Classifiers:
E
 Elastic Net Regression:
 Entropy:
 Error Analysis:
 Expectation Maximization: ExpectationMaximization (EM) algorithm is a way to find maximum likelihood estimates for model parameters when the data is incomplete, has missing data points, or has unobserved (hidden) latent variables. It uses an iterative approach to approximate the maximum likelihood function.
 Useful links: Introductory blog post by me  Advanced blog post by me
F
 False Positive Rate: The false positive rate is calculated as the ratio between the number of negative events wrongly categorized as positive (false positives) and the total number of actual negative events (regardless of classification).
 Useful links: False Positive Rate Wiki
 Feature Selection:
G
 Generative Classifiers:
 Gradient Descent: Gradient Descent is an optimization technique to minimize a loss function by computing the gradients of the loss function with respect to the model's parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters and gradually finding the best combination to minimize the loss.
 Useful links: Blog post on towardsdatascience  Blog post on kdnuggets
 Grid Search:
H
 Hinge Loss: Hinge loss is used in context of classification problems and is defined as $l(y) = max(0, 1  t.y)$, where t is the actual output and y is the classifier's score. Observing the function, we can see that classifier is penalized unless it classifies data points correctly with 100% confidence. This leads to "maximummargin" classification where each training data point is as far from classifier's decision boundary as possible.
 Also see: Support Vector Machines
 Useful links: Hinge Loss Wiki
I
 Information Gain: See Kullback–Leibler Divergence.
J
 Jaccard Similarity: Jaccard Similarity is a statistic used for comparing the similarity and diversity of finite sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets $\left(J(A, B) = \frac{A \cap B}{A \cup B}\right)$.
 Also see: Correlation
 Useful links: Jaccard Similarity Wiki  Explanation with examples
K
 KNearest Neighbor: KNN is essentially a classification technique that finds the ($K$) data points in the training data which are most similar to an unseen data point, and takes a majority vote to make classifications. KNN is a nonparametric method which means that it does not make any assumptions on the underlying data distribution. Performance of KNN methods depends on the data representation and the definition of closeness/similarity.
 Useful links: Video explanation by Trevor Hastie  Blog post on Medium
 Kullback–Leibler Divergence: Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution. A familiar use case for this is when we replace observed data or complex distributions with a simpler approximating distribution, we can use KL Divergence to measure just how much information we lose when we choose an approximation.
 Useful links: Blog post on countbayesie  Blog post on towardsdatascience
L
 Lasso Regression:
 Learning Curve:
 Linear Discriminant Analysis:
 Linear Regression: Linear regression models linear relationship between a scalar dependent variable (usually called target) and several independent variables (usually called predictors). It can be used for forecasting outcomes once the model parameters are learned using supervision from a relevant dataset. Additionally, the learned model parameters can also be used to explain the strength of the relationship between the target and the predictors (a procedure known as linear regression analysis). The model parameters are usually learned by minimizing mean squared error.
 Useful links: Video playlist from Stanford  Blog post on towardsdatascience
 Logistic Regression: Logistic regression models the probability of a certain binary outcome given some predictor variables which influence the outcome. It uses a linear function on predictor variables like linear regression but then transforms it into a probability using the logistic function $\left( \sigma(z) = \frac{1}{1 + e^{z}} \right)$. The model parameters are usually learned by maximizing likelihood of observed data.
 Also see: Maximum Likelihood Estimation
 Useful links: Video explanation by Trevor Hastie  Blog post on towardsdatascience
M
 Maximum Likelihood Estimation: Maximum likelihood estimation is a method of estimating the parameters of a statistical model $\theta$ such that the likelihood function $L(\theta; x)$, which is a function of model parameters given observed data $x$, is maximized. Intuitively, this selects the parameters $\theta$ that make the observed data most probable.
 Useful links: Video explanation by Trevor Hastie  Blog post on towardsdatascience
 Model Selection:
N
 Naive Bayes Classifier: Naive Bayes is a generative classification technique based on Bayes’ Theorem. It assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature and they all independently contribute towards the class probability.
 Useful links: Video Explanation by Trevor Hastie  Blog post on towardsdatascience
 Neural Network:
O
 Ordinal Classification: Same as Ordinal Regression.
 Ordinal Regression: Ordinal Regression is used for predicting ordinal outcomes, i.e. whose value exists on an arbitrary scale where only the relative ordering between different values is significant, based on various predictor variables. That is why, it is considered as an intermediate problem between regression and classification. Usually ordinal regression problem is reduced to multiple binary classification problems with the help of threshold parameters such that classifier's score falling within certain threshold correspond to one of the ordinal outcomes.
P
 Pearson’s Correlation Coefficient: Correlation coefficient ($\rho$) ranges from 1 to +1. The closer $\rho$ is to +1 or 1, the more closely the two variables are related and if it is close to 0, the variables have no relation with each other. It is defined as $\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_{X}.\sigma_{Y}}$.
 Useful links: Pearson Correlation Wiki
 Precision: If we are given a set of instances, precision is the fraction of relevant instances (those correctly classified into a certain class $C$) among the retrieved instances (those belonging to a certain class $C$). A perfect precision score of 1.0 means that every result retrieved by a search was relevant, but says nothing about whether all relevant documents were retrieved.
 Also see: Recall
 Useful links: Blog post on towardsdatascience  Precision and Recall Wiki
 Principle Component Analysis: PCA is a statistical procedure that transforms a set of observations of possibly correlated variables into a set of observations with linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component variance in decreasing order with the constraint that it is orthogonal to the preceding components. Utilizing only a few components that capture most of the variance in data helps in fighting the Curse of Dimensionality.
 Also see: Linear Discriminant Analysis
 Useful links: Video Explanation by Stanford Profs  Online Lesson by Penn State University
 Pruning:
Q
R
 $R^2$:
 Random Forest: Random Forest is a supervised learning algorithm that builds an ensemble of Decision Trees, where each decision tree is allowed to use a fixed number of randomly chosen features. The decision trees are trained using the Bagging technique and the output of trees are merged together to get a more accurate and stable prediction.
 Also see: Boosting
 Useful links: Blog post on towardsdatascience  Blog post on Medium
 Recall: If we are given a set of instances, recall is the fraction of relevant instances (belonging to a certain class $C$) that have been retrieved (or correctly classified in $C$) over the total number of relevant instances. A recall of 1.0 means that every item from class $C$ was labeled as belonging to class $C$, but does not say anything about other items that were incorrectly labeled as belonging to class $C$.
 Also see: Precision
 Useful links: Blog post on towardsdatascience  Precision and Recall Wiki
 Regression: Regression is the problem of approximating a mapping function ($f$) from input variables ($X$) to a continuous output variable ($y$), on the basis of a training set of data containing observations in the form of inputoutput pairs.
 Also see: Linear Regression
 Useful links: Video Explanation by Trevor Hastie
 Relative Entropy: See Kullback–Leibler Divergence.
 Ridge Regression:
S
 Sensitivity: Same as Recall.
 Specificity: If we are given a set of instances, specificity measures the proportion of actual negatives (instances not belonging to a particular class) that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).
 Useful links: Specificity Wiki.
 Standard Score: Same as Zscore.
 Standard Error:
 Stratified Cross Validation:
 Supervised Learning: Supervised learning is a task of learning a function that can map an unseen input to an output as accurately as possible based on the example inputoutput pairs known as training data.
 Also see: Classification  Regression
 Useful links: Coursera Video Explanation  Supervised Learning Wiki
 Support Vector Machine: Support Vector Machine (SVM), in simplest terms, is a classification algorithm which aims to find a decision boundary that separates two classes such that the closest data points from either class are as far as possible. Having a good margin between the two classes contributes to robustness and generalizability of SVM.
 Also see: Hinge Loss
 Useful links: Blog post by Me  Video Lecture by Patrick Winston
T
 TTest: The ttest is one type of inferential statistics that is used to determine whether there is a significant difference between the means of two groups. Ttest assumes that the two groups follow a normal distribution and calculates the tvalue (extension of zscore), which is inextricably linked to certain probability value (pvalue). Pvalue denotes the likelihood of two distribution being the same and if the value is below a certain agreedupon threshold, ttest concludes that the two distributions are sufficiently different.
 Useful links: Blog post by University of Connecticut  Description on investopedia
 True Positive Rate: Same as Recall.
 True Negative Rate: Same as Specificity.
U
 Unsupervised Learning: Unsupervised learning is the task of inferring patterns from data without having any reference to known, or labeled, outcomes. It is generally used for discovering the underlying structure of the data.
 Also see: Principle Component Analysis
 Useful links: Blog post by Hackernoon  Coursera Video Explanation
V
W
X
Y
Z
 Zscore: Zscore is a measure of how many standard deviations below or above the population mean a raw score is, thus giving us a good picture when we want to compare results from a test to a "normal" population.
 Also see: TTest
 Useful links: Zscore Wiki  Khan Academy tutorial on Zscore