Machine Learning Glossary

Introduction

The goal of this post is to briefly explain popular (and unpopular) concepts in Machine Learning, the idea for which stemmed from my travails for finding good quality explanations of various Machine Learning concepts on the web. Unlike similar posts on the web, here you’ll also find links to good quality resources and to related concepts for more holistic understanding. Hopefully, this post would be helpful to the people who are just starting in Machine Learning as well as to the people who need a quick refresher on some concepts.

Didn’t find what you were looking for? Consider contributing by creating a pull request on this post here.

A

A/B Testing: A/B testing, also known as split testing, is a statistical method used to compare two versions of a variable (A and B) to determine which one performs better in a controlled environment. It is widely used in marketing, web design, and product development to test changes to a web page or product against the current design and determine which one produces better results. The goal is to identify the impact of a change and make data-driven decisions.
- Useful link: A/B Testing: A Step-by-Step Guide
Accuracy: Accuracy is a metric used to evaluate the performance of a classification model. It is defined as the ratio of the number of correct predictions to the total number of predictions. While accuracy is a straightforward and intuitive measure, it can be misleading in the case of imbalanced datasets, where the majority class dominates the prediction results.
- Useful link: Understanding Accuracy in Machine Learning
Activation Function: Activation functions are mathematical functions used in neural networks to introduce non-linearity into the model. This non-linearity allows the network to learn complex patterns. Common activation functions include the sigmoid function, which maps inputs to values between 0 and 1, the hyperbolic tangent (tanh), which maps inputs to values between -1 and 1, and the Rectified Linear Unit (ReLU), which replaces negative values with zero. Activation functions are crucial in deep learning, particularly in multi-layer networks, as they enable the network to stack layers and learn intricate relationships.
- Useful link: Activation Functions in Neural Networks
Active Learning: Active learning is a machine learning approach where the model is allowed to interactively query a user or some other information source to obtain the desired outputs at new data points. This is particularly useful when labeled data is scarce or expensive to obtain. The model identifies the most informative data points that, when labeled, would most improve its performance.
- Useful link: Active Learning Explained
AdaBoost (Adaptive Boosting): AdaBoost is an ensemble learning technique that combines multiple weak classifiers to form a strong classifier. It works by sequentially training weak learners, typically decision trees, on weighted versions of the data. Misclassified instances in each iteration are given more weight so that subsequent classifiers focus on the harder-to-classify instances. AdaBoost aims to minimize the error rate by focusing on difficult cases and is used for both classification and regression tasks. It is known for its ability to improve the performance of models with lower complexity.
- Useful link: Understanding AdaBoost
Alignment: In machine learning, alignment refers to the process of ensuring that the model’s objectives are consistent with the user’s goals. This is crucial in AI ethics and safety, especially in reinforcement learning where the agent’s goals must align with human values and safety constraints. Proper alignment ensures that the AI system behaves in a predictable and beneficial manner.
- Useful link: AI Alignment Explained
Anova (Analysis of Variance): ANOVA is a statistical method used to compare the means of three or more groups to understand if at least one of the group means is statistically different from the others. It helps in determining the influence of one or more factors by comparing the means of different samples. ANOVA is used in various fields including biology, economics, and engineering to analyze experimental data.
- Useful link: Introduction to ANOVA
Artificial Neural Network (ANN): An Artificial Neural Network (ANN) is a computational model inspired by the way biological neural networks in the human brain process information. ANNs consist of interconnected units (neurons) organized in layers: input, hidden, and output layers. Each connection has a weight that adjusts as learning proceeds, based on a training algorithm like backpropagation. ANNs are used in various tasks including image and speech recognition, natural language processing, and game playing. Their ability to model complex non-linear relationships makes them powerful tools in many AI applications.
- Useful link: Introduction to Artificial Neural Networks
Attention Mechanism: The attention mechanism is a component used in neural networks, particularly in sequence models like transformers. It allows the model to focus on different parts of the input sequence, enabling it to capture dependencies and relationships more effectively. Attention mechanisms are integral to models like BERT and GPT, which have achieved state-of-the-art results in natural language processing tasks.
- Useful link: Attention Mechanisms in Deep Learning
Autoencoder: An autoencoder is a type of neural network used for unsupervised learning. It aims to learn a compressed representation (encoding) of the input data and then reconstruct the data from this encoding. An autoencoder consists of an encoder that maps the input to a latent-space representation and a decoder that maps the latent space back to the original input. Autoencoders are used for dimensionality reduction, denoising, and feature learning. Variants like variational autoencoders (VAEs) introduce probabilistic elements to model the data distribution.
- Useful link: Autoencoders and Their Applications
AUC: AUC is the Area Under the Receiver Operating Characteristic (ROC) Curve. ROC curve is obtained by varying the classification threshold of a binary classifier and plotting the true positive rate (TPR) against the false positive rate (FPR) at each threshold. It is a popular classification performance metric and has several nice properties like being independent of decision threshold, being robust to the class imbalance in data and so on.
- Useful links: Video Explanation of AUC | Probabilistic interpretation of AUC

B

Bagging: Bagging is a procedure that produces several different training sets of the same size with replacement and then trains a machine learning model for each set. The predictions are produced by taking a majority vote in a classification task and by averaging in a regression task. Bagging helps in reducing variance from models.
- Also see: Random Forest
- Useful links: Video explanation by Udacity | Blog post on Medium
Bias-Variance Trade-off: Bias here refers to the difference between the average prediction of a model and target value the model is trying to predict. Variance refers to the variability in the model predictions for a given data point because of its sensitivity to small fluctuations in the training set. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand, if our model has a large number of parameters, then it may have high variance and low bias. Thus, we need to find the right/good balance between bias and variance without overfitting and underfitting the data.
- Useful links: Video explanation by Trevor Hastie | Blog post on towardsdatascience
Bootstrapping: Bootstrapping is the process of dividing the dataset into multiple subsets, with replacement. Each subset is of the same size of the dataset and the samples are called bootstrap samples. It is used in bagging.
- Also see: Bagging
- Useful links: Bootstrapping wiki | Blog post by machinelearningmastery
Boosting: Boosting is an ensemble method for improving the model predictions of any given learning algorithm. The idea is to train weak learners sequentially, each trying to correct its predecessor, to build strong learners. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.
- Also see: Bagging
- Useful links: Lecture by Patrick Winston | Boosting wiki

C

Classification: Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known.
- Also see: Boosting | Decision Trees | K-Nearest Neighbor | Logistic Regression | Random Forest | Naive Bayes Classifier
- Useful links: Classification Wiki
Clustering: Clustering is an unsupervised learning technique used to group similar data points together based on their features. The goal is to partition the data into clusters where points within each cluster are more similar to each other than to those in other clusters. Common clustering algorithms include k-means, hierarchical clustering, and DBSCAN. Clustering is widely used in various applications such as market segmentation, image segmentation, and anomaly detection.
- Useful links: Introduction to Clustering
Confusion Matrix: A confusion matrix is a table used to evaluate the performance of a classification model. It compares the actual target values with those predicted by the model. The matrix includes four key metrics: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). From these, other metrics such as accuracy, precision, recall, and F1-score can be derived. The confusion matrix provides a detailed breakdown of the model’s performance.
- Useful links: Understanding Confusion Matrix
Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is a deep learning algorithm commonly used for image recognition and processing tasks. CNNs are designed to automatically and adaptively learn spatial hierarchies of features through convolutional layers, pooling layers, and fully connected layers. The convolutional layers apply filters to the input image, capturing local patterns such as edges, textures, and shapes. CNNs are widely used in computer vision applications, including image classification, object detection, and segmentation.
- Useful links: Understanding Convolutional Neural Networks
Correlation: Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. Pearson’s Correlation Coefficient is used to measure the strength of correlation between two variables.
- Useful links: Blog post on Correlation | Detailed Explanation of Correlation
- Useful links: Blog post by surveysystem
Cross Validation: Cross-validation is a statistical technique used to assess the generalization performance of a machine learning model. The most common form is k-fold cross-validation, where the data is divided into k subsets (folds). The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold used exactly once as the test set. Cross-validation helps in selecting models and tuning hyperparameters by providing a more reliable estimate of model performance.
- Useful links: Understanding Cross-Validation
Curse of Dimensionality: In a model, as the number of features or dimensions grows, the amount of data needed to make the model generalizable with good performance grows exponentially, which unnecessarily increases storage space and processing time for a modeling algorithm. In this sense, value added by an additional dimension becomes much smaller compared to the overhead it adds to the algorithm.
- Also see: Dimensionality Reduction
- Useful links: Video explanation by Trevor Hastie | Elaborate post on Medium

D

Decision Tree: A Decision Tree can be used to visually and explicitly represent decisions and decision making. Each non-leaf node in the tree represents a decision based on one of the features in the dataset. Leaves of the trees represent the final output after a series of decisions; for classification, the output is class membership based on a majority vote from node members and for regression, the output is the average value of node members. The feature used to make a decision at each step is chosen such that the information gain is maximized.
- Also see: Boosting | Random Forest
- Useful links: Video Lecture by Patrick Winston | Blog post on towardsdatascience
Dimensionality Reduction: The goal of dimensionality reduction methods is to find a low-dimensional representation of the data that retains as much information as possible. This low-dimensional data representation in turn helps in fighting the Curse of Dimensionality.
- Also see: Principle Component Analysis
- Useful links: Video Explanation by Robert Tibshirani | Blog post on towardsdatascience
Discriminative Classifiers:

E

Elastic Net Regression: Elastic Net Regression is a regularized regression technique that linearly combines the penalties of the L1 (Lasso) and L2 (Ridge) regularization methods. It is particularly useful when there are multiple correlated features, as it can select groups of correlated variables. The elastic net penalty encourages sparsity (like Lasso) and also includes a ridge regression penalty to maintain stability. This approach balances between the feature selection property of Lasso and the regularization strength of Ridge Regression.
- Useful links: Elastic Net Regression Explained
Entropy: In the context of machine learning, entropy is a measure of the uncertainty or impurity in a dataset. It quantifies the amount of disorder or randomness. In decision trees, entropy is used to decide the best split at each node by measuring the impurity before and after the split. Lower entropy indicates higher purity, meaning the data is more homogeneous. Entropy is a fundamental concept in information theory and is used to build efficient models by minimizing uncertainty.
- Useful links: Understanding Entropy in Machine Learning
Ensemble Learning: Ensemble learning is a technique that combines multiple individual models (often called base learners or weak learners) to create a stronger overall model. The primary goal is to improve the predictive performance by reducing variance, bias, or improving accuracy. Popular ensemble methods include bagging (e.g., Random Forest), boosting (e.g., AdaBoost, Gradient Boosting), and stacking. Ensemble learning is widely used in both classification and regression tasks due to its ability to enhance model robustness and accuracy.
- Useful links: Ensemble Learning Methods
Empirical Risk Minimization (ERM): Empirical Risk Minimization (ERM) is a principle in statistical learning theory where the goal is to minimize the empirical risk, i.e., the average loss over the training sample. The empirical risk is calculated based on a chosen loss function that quantifies the error between predicted and actual values. ERM is fundamental to the training process of machine learning models, guiding the optimization of model parameters to fit the training data.
- Useful links: [Empirical Risk Minimization Explained](https://mlweb.loria.fr/book/en/erm.html#:~:text=The%20Empirical%20Risk%20Minimization%20(ERM,alternative%20name%20of%20empirical%20risk.)
Epoch: An epoch in machine learning refers to one complete pass of the training dataset through the algorithm during the training process. Training a model typically involves multiple epochs, as each pass helps the model learn the underlying patterns in the data. After each epoch, the model’s parameters are updated to minimize the loss function. The number of epochs is a hyperparameter that needs to be tuned to ensure the model neither underfits nor overfits the training data.
- Useful links: Understanding Epochs in Machine Learning
Exploratory Data Analysis (EDA): Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often using visual methods. EDA is a crucial step in the data preprocessing phase, helping to understand the data’s structure, detect outliers, find patterns, and test hypotheses. Techniques used in EDA include summary statistics, data visualization (e.g., histograms, box plots), and correlation analysis. EDA provides insights that guide further data cleaning and feature engineering processes.
- Useful links: What is Exploratory Data Analysis?
Exponential Smoothing: Exponential smoothing is a time series forecasting method for univariate data. It involves using weighted averages of past observations, where the weights decrease exponentially as the observations get older. The simplest form, single exponential smoothing, is used for data without a trend or seasonality. More complex forms, like double and triple exponential smoothing (Holt-Winters), can capture trends and seasonality. Exponential smoothing is widely used in forecasting applications due to its simplicity and effectiveness.
- Useful links: Introduction to Exponential Smoothing
Early Stopping: Early stopping is a regularization technique used to prevent overfitting in machine learning models, particularly in deep learning. It involves monitoring the model’s performance on a validation set during training and stopping the training process once the performance stops improving. This helps in finding the optimal number of epochs for training, ensuring that the model generalizes well to unseen data without overfitting the training data.
- Useful links: Early Stopping in Machine Learning
Embedding: Embedding refers to the representation of categorical data or discrete items in a continuous vector space. This technique is commonly used in natural language processing (NLP) to convert words into dense vectors of real numbers, capturing semantic meaning. Word embeddings like Word2Vec, GloVe, and FastText are examples. Embeddings are also used in recommendation systems, where items (e.g., movies, products) are represented in a latent space to capture similarities and relationships.
- Useful links: Understanding Word Embeddings
Evaluation Metrics: Evaluation metrics are measures used to assess the performance of a machine learning model. Common metrics for classification tasks include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC). For regression tasks, metrics like mean squared error (MSE), mean absolute error (MAE), and R-squared are used. Choosing the right evaluation metric is crucial as it directly impacts model selection, hyperparameter tuning, and overall performance assessment.
- Useful links: Evaluation Metrics for Machine Learning
Evolutionary Algorithms: Evolutionary Algorithms (EAs) are a subset of evolutionary computation, inspired by natural selection and genetics. They are used to solve optimization problems by iteratively improving a set of candidate solutions. Key components include selection, mutation, crossover, and reproduction. Common types of EAs include Genetic Algorithms, Genetic Programming, and Evolutionary Strategies. EAs are particularly useful for solving complex problems where traditional optimization methods are not effective.
- Useful links: Introduction to Evolutionary Algorithms
Expectation-Maximization (EM) Algorithm: The Expectation-Maximization (EM) algorithm is an iterative method used for finding maximum likelihood estimates of parameters in statistical models, especially when the data involves latent variables. It consists of two steps: the Expectation step (E-step), which estimates the missing data given the current parameters, and the Maximization step (M-step), which maximizes the likelihood function based on the estimated data. EM is commonly used in clustering (e.g., Gaussian Mixture Models) and missing data imputation.
- Useful links: Introductory blog post by me | Advanced blog post by me

F

Feature Engineering: Feature engineering is the process of using domain knowledge to create new features or modify existing ones to improve the performance of a machine learning model. This process includes techniques such as normalization, encoding categorical variables, creating interaction terms, and extracting features from date-time data. Effective feature engineering can significantly enhance a model’s predictive power and is often more impactful than choosing complex algorithms.
- Useful links: Feature Engineering for Machine Learning
Feature Selection: Feature selection is the process of selecting a subset of relevant features for building a robust machine learning model. The main goal is to improve the model’s performance by reducing overfitting, enhancing generalization, and decreasing training time. Common techniques include filter methods (e.g., mutual information), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., Lasso regression). Feature selection helps in identifying the most significant features that contribute to the predictive power of the model.
- Useful links: An Introduction to Feature Selection
Forward Propagation: Forward propagation is the process of moving input data through the layers of a neural network to generate an output. During this process, each neuron computes a weighted sum of its inputs, applies an activation function, and passes the result to the next layer. Forward propagation is a crucial step in both training and prediction phases of neural networks, as it computes the predicted output for a given input.
- Useful links: Understanding Forward Propagation
F1 Score: The F1 Score is a metric used to evaluate the performance of a classification model. It is the harmonic mean of precision and recall, providing a single measure that balances both concerns. The F1 Score is particularly useful when dealing with imbalanced datasets, where focusing solely on accuracy can be misleading. A high F1 Score indicates that the model has both high precision (few false positives) and high recall (few false negatives).
- Useful links: F1 Score Explained
Feature Scaling: Feature scaling is a technique used to normalize the range of independent variables or features of data. In machine learning, this is important for algorithms that compute distances between data points, like k-nearest neighbors (KNN) and support vector machines (SVM). Common methods of feature scaling include Min-Max scaling and Standardization (Z-score normalization). Scaling ensures that each feature contributes equally to the model’s performance.
- Useful links: Feature Scaling Techniques
Factorization Machines: Factorization Machines are a type of model that generalizes matrix factorization and linear regression, designed to capture interactions between features in high-dimensional sparse datasets. They are particularly effective for recommendation systems and predictive modeling tasks where interactions between variables are significant. Factorization Machines model pairwise interactions among features, providing a powerful way to handle large and sparse data.
- Useful links: Factorization Machines Explained
False Positive Rate (FPR): The False Positive Rate (FPR) is a metric used to measure the proportion of negative instances that are incorrectly classified as positive by a machine learning model. It is calculated as the ratio of false positives (FP) to the sum of false positives and true negatives (TN). FPR is crucial in scenarios where the cost of false positives is high, such as fraud detection or medical diagnosis.
- Useful links: Understanding False Positive Rate
Forward Selection: Forward Selection is a feature selection technique used to iteratively add features to a model based on their statistical significance. Starting with no features, the algorithm adds one feature at a time, selecting the one that improves the model the most at each step. This process continues until adding more features does not significantly improve the model. Forward Selection helps in identifying the most important features and building a parsimonious model.
- Useful links: Forward Selection Explained
Fisher’s Linear Discriminant: Fisher’s Linear Discriminant is a linear classification technique used to find a linear combination of features that best separates two or more classes of objects. It maximizes the ratio of the variance between the classes to the variance within the classes, providing a projection that enhances class separability. Fisher’s Linear Discriminant is widely used in pattern recognition and machine learning for its simplicity and effectiveness.
- Useful links: Fisher’s Linear Discriminant Explained
Fine-Tuning: Fine-tuning is a transfer learning technique where a pre-trained model is further trained on a new, typically smaller, dataset. This process involves adjusting the weights of the pre-trained model slightly to better fit the new data while leveraging the knowledge the model has already learned. Fine-tuning is commonly used in natural language processing (NLP) and computer vision tasks to adapt pre-trained models to specific applications.
- Useful links: Fine-Tuning in Deep Learning

G

Generative Classifiers:
Gradient Descent: Gradient Descent is an optimization technique to minimize a loss function by computing the gradients of the loss function with respect to the model’s parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters and gradually finding the best combination to minimize the loss.
- Useful links: Blog post on towardsdatascience | Blog post on kdnuggets
Grid Search:

H

Hinge Loss: Hinge loss is used in context of classification problems and is defined as $l(y) = max(0, 1 - t.y)$, where t is the actual output and y is the classifier’s score. Observing the function, we can see that classifier is penalized unless it classifies data points correctly with 100% confidence. This leads to “maximum-margin” classification where each training data point is as far from classifier’s decision boundary as possible.
- Also see: Support Vector Machines
- Useful links: Hinge Loss Wiki

I

Information Gain: See Kullback–Leibler Divergence.

J

Jaccard Similarity: Jaccard Similarity is a statistic used for comparing the similarity and diversity of finite sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets $\left(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\right)$.
- Also see: Correlation
- Useful links: Jaccard Similarity Wiki | Explanation with examples

K

K-Nearest Neighbor: KNN is essentially a classification technique that finds the ($K$) data points in the training data which are most similar to an unseen data point, and takes a majority vote to make classifications. KNN is a non-parametric method which means that it does not make any assumptions on the underlying data distribution. Performance of KNN methods depends on the data representation and the definition of closeness/similarity.
- Useful links: Video explanation by Trevor Hastie | Blog post on Medium
Kullback–Leibler Divergence: Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution. A familiar use case for this is when we replace observed data or complex distributions with a simpler approximating distribution, we can use KL Divergence to measure just how much information we lose when we choose an approximation.
- Useful links: Blog post on countbayesie | Blog post on towardsdatascience

L

Lasso Regression:
Learning Curve:
Linear Discriminant Analysis:
Linear Regression: Linear regression models linear relationship between a scalar dependent variable (usually called target) and several independent variables (usually called predictors). It can be used for forecasting outcomes once the model parameters are learned using supervision from a relevant dataset. Additionally, the learned model parameters can also be used to explain the strength of the relationship between the target and the predictors (a procedure known as linear regression analysis). The model parameters are usually learned by minimizing mean squared error.
- Useful links: Video playlist from Stanford | Blog post on towardsdatascience
Logistic Regression: Logistic regression models the probability of a certain binary outcome given some predictor variables which influence the outcome. It uses a linear function on predictor variables like linear regression but then transforms it into a probability using the logistic function $\left( \sigma(z) = \frac{1}{1 + e^{-z}} \right)$. The model parameters are usually learned by maximizing likelihood of observed data.
- Also see: Maximum Likelihood Estimation
- Useful links: Video explanation by Trevor Hastie | Blog post on towardsdatascience

M

Maximum Likelihood Estimation: Maximum likelihood estimation is a method of estimating the parameters of a statistical model $\theta$ such that the likelihood function $L(\theta; x)$, which is a function of model parameters given observed data $x$, is maximized. Intuitively, this selects the parameters $\theta$ that make the observed data most probable.
- Useful links: Video explanation by Trevor Hastie | Blog post on towardsdatascience
Model Selection:

N

Naive Bayes Classifier: Naive Bayes is a generative classification technique based on Bayes’ Theorem. It assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature and they all independently contribute towards the class probability.
- Useful links: Video Explanation by Trevor Hastie | Blog post on towardsdatascience
Neural Network:

O

Ordinal Classification: Same as Ordinal Regression.
Ordinal Regression: Ordinal Regression is used for predicting ordinal outcomes, i.e. whose value exists on an arbitrary scale where only the relative ordering between different values is significant, based on various predictor variables. That is why, it is considered as an intermediate problem between regression and classification. Usually ordinal regression problem is reduced to multiple binary classification problems with the help of threshold parameters such that classifier’s score falling within certain threshold correspond to one of the ordinal outcomes.
- Useful links: Ordinal Regression Wiki | Book Chapter | Post on applying Ordinal Regression to predict clothing fit

P

Pearson’s Correlation Coefficient: Correlation coefficient ($\rho$) ranges from -1 to +1. The closer $\rho$ is to +1 or -1, the more closely the two variables are related and if it is close to 0, the variables have no relation with each other. It is defined as $\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_{X}.\sigma_{Y}}$.
- Useful links: Pearson Correlation Wiki
Precision: If we are given a set of instances, precision is the fraction of relevant instances (those correctly classified into a certain class $C$) among the retrieved instances (those belonging to a certain class $C$). A perfect precision score of 1.0 means that every result retrieved by a search was relevant, but says nothing about whether all relevant documents were retrieved.
- Also see: Recall
- Useful links: Blog post on towardsdatascience | Precision and Recall Wiki
Principle Component Analysis: PCA is a statistical procedure that transforms a set of observations of possibly correlated variables into a set of observations with linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component variance in decreasing order with the constraint that it is orthogonal to the preceding components. Utilizing only a few components that capture most of the variance in data helps in fighting the Curse of Dimensionality.
- Also see: Linear Discriminant Analysis
- Useful links: Video Explanation by Stanford Profs | Online Lesson by Penn State University
Pruning:

Q

R

$R^2$:
Random Forest: Random Forest is a supervised learning algorithm that builds an ensemble of Decision Trees, where each decision tree is allowed to use a fixed number of randomly chosen features. The decision trees are trained using the Bagging technique and the output of trees are merged together to get a more accurate and stable prediction.
- Also see: Boosting
- Useful links: Blog post on towardsdatascience | Blog post on Medium
Recall: If we are given a set of instances, recall is the fraction of relevant instances (belonging to a certain class $C$) that have been retrieved (or correctly classified in $C$) over the total number of relevant instances. A recall of 1.0 means that every item from class $C$ was labeled as belonging to class $C$, but does not say anything about other items that were incorrectly labeled as belonging to class $C$.
- Also see: Precision
- Useful links: Blog post on towardsdatascience | Precision and Recall Wiki
Regression: Regression is the problem of approximating a mapping function ($f$) from input variables ($X$) to a continuous output variable ($y$), on the basis of a training set of data containing observations in the form of input-output pairs.
- Also see: Linear Regression
- Useful links: Video Explanation by Trevor Hastie
Relative Entropy: See Kullback–Leibler Divergence.
Ridge Regression:

S

Sensitivity: Same as Recall.
Specificity: If we are given a set of instances, specificity measures the proportion of actual negatives (instances not belonging to a particular class) that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).
- Useful links: Specificity Wiki.
Standard Score: Same as Z-score.
Standard Error:
Stratified Cross Validation:
Supervised Learning: Supervised learning is a task of learning a function that can map an unseen input to an output as accurately as possible based on the example input-output pairs known as training data.
- Also see: Classification | Regression
- Useful links: Coursera Video Explanation | Supervised Learning Wiki
Support Vector Machine: Support Vector Machine (SVM), in simplest terms, is a classification algorithm which aims to find a decision boundary that separates two classes such that the closest data points from either class are as far as possible. Having a good margin between the two classes contributes to robustness and generalizability of SVM.
- Also see: Hinge Loss
- Useful links: Blog post by Me | Video Lecture by Patrick Winston

T

T-Test: The t-test is one type of inferential statistics that is used to determine whether there is a significant difference between the means of two groups. T-test assumes that the two groups follow a normal distribution and calculates the t-value (extension of z-score), which is inextricably linked to certain probability value (p-value). P-value denotes the likelihood of two distribution being the same and if the value is below a certain agreed-upon threshold, t-test concludes that the two distributions are sufficiently different.
- Useful links: Blog post by University of Connecticut | Description on investopedia
True Positive Rate: Same as Recall.
True Negative Rate: Same as Specificity.

U

Unsupervised Learning: Unsupervised learning is the task of inferring patterns from data without having any reference to known, or labeled, outcomes. It is generally used for discovering the underlying structure of the data.
- Also see: Principle Component Analysis
- Useful links: Blog post by Hackernoon | Coursera Video Explanation

V

W

X

Y

Z

Z-score: Z-score is a measure of how many standard deviations below or above the population mean a raw score is, thus giving us a good picture when we want to compare results from a test to a “normal” population.
- Also see: T-Test
- Useful links: Z-score Wiki | Khan Academy tutorial on Z-score

Written on January 1, 2019

Rishabh Misra

ML Engineer & Researcher

Book Author 'Sculpting Data for ML'

Program Committee @ Leading AI Conferences

Machine Learning Glossary

Introduction

Jump to

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

X

Y

Z