Machine Learning Glossary

Introduction

The goal of this post is to briefly explain popular (and unpopular) concepts in Machine Learning, the idea for which stemmed from my travails for finding good quality explanations of various Machine Learning concepts on the web. Unlike similar posts on the web, here you’ll also find links to good quality resources and to related concepts for more holistic understanding. Hopefully, this post would be helpful to the people who are just starting in Machine Learning as well as to the people who need a quick refresher on some concepts.

Didn’t find what you were looking for? Consider contributing by creating a pull request on this post here.

Jump to

A . B . C . D . E . F . G . H . I . J . K . L . M . N . O . P . Q . R . S . T . U . V . W . X . Y . Z

A

  • A/B Testing: A/B testing, also known as split testing, is a statistical method used to compare two versions of a variable (A and B) to determine which one performs better in a controlled environment. It is widely used in marketing, web design, and product development to test changes to a web page or product against the current design and determine which one produces better results. The goal is to identify the impact of a change and make data-driven decisions.
  • Accuracy: Accuracy is a metric used to evaluate the performance of a classification model. It is defined as the ratio of the number of correct predictions to the total number of predictions. While accuracy is a straightforward and intuitive measure, it can be misleading in the case of imbalanced datasets, where the majority class dominates the prediction results.
  • Activation Function: Activation functions are mathematical functions used in neural networks to introduce non-linearity into the model. This non-linearity allows the network to learn complex patterns. Common activation functions include the sigmoid function, which maps inputs to values between 0 and 1, the hyperbolic tangent (tanh), which maps inputs to values between -1 and 1, and the Rectified Linear Unit (ReLU), which replaces negative values with zero. Activation functions are crucial in deep learning, particularly in multi-layer networks, as they enable the network to stack layers and learn intricate relationships.
  • Active Learning: Active learning is a machine learning approach where the model is allowed to interactively query a user or some other information source to obtain the desired outputs at new data points. This is particularly useful when labeled data is scarce or expensive to obtain. The model identifies the most informative data points that, when labeled, would most improve its performance.
  • AdaBoost (Adaptive Boosting): AdaBoost is an ensemble learning technique that combines multiple weak classifiers to form a strong classifier. It works by sequentially training weak learners, typically decision trees, on weighted versions of the data. Misclassified instances in each iteration are given more weight so that subsequent classifiers focus on the harder-to-classify instances. AdaBoost aims to minimize the error rate by focusing on difficult cases and is used for both classification and regression tasks. It is known for its ability to improve the performance of models with lower complexity.
  • Alignment: In machine learning, alignment refers to the process of ensuring that the model’s objectives are consistent with the user’s goals. This is crucial in AI ethics and safety, especially in reinforcement learning where the agent’s goals must align with human values and safety constraints. Proper alignment ensures that the AI system behaves in a predictable and beneficial manner.
  • Anova (Analysis of Variance): ANOVA is a statistical method used to compare the means of three or more groups to understand if at least one of the group means is statistically different from the others. It helps in determining the influence of one or more factors by comparing the means of different samples. ANOVA is used in various fields including biology, economics, and engineering to analyze experimental data.
  • Artificial Neural Network (ANN): An Artificial Neural Network (ANN) is a computational model inspired by the way biological neural networks in the human brain process information. ANNs consist of interconnected units (neurons) organized in layers: input, hidden, and output layers. Each connection has a weight that adjusts as learning proceeds, based on a training algorithm like backpropagation. ANNs are used in various tasks including image and speech recognition, natural language processing, and game playing. Their ability to model complex non-linear relationships makes them powerful tools in many AI applications.
  • Attention Mechanism: The attention mechanism is a component used in neural networks, particularly in sequence models like transformers. It allows the model to focus on different parts of the input sequence, enabling it to capture dependencies and relationships more effectively. Attention mechanisms are integral to models like BERT and GPT, which have achieved state-of-the-art results in natural language processing tasks.
  • Autoencoder: An autoencoder is a type of neural network used for unsupervised learning. It aims to learn a compressed representation (encoding) of the input data and then reconstruct the data from this encoding. An autoencoder consists of an encoder that maps the input to a latent-space representation and a decoder that maps the latent space back to the original input. Autoencoders are used for dimensionality reduction, denoising, and feature learning. Variants like variational autoencoders (VAEs) introduce probabilistic elements to model the data distribution.
  • AUC: AUC is the Area Under the Receiver Operating Characteristic (ROC) Curve. ROC curve is obtained by varying the classification threshold of a binary classifier and plotting the true positive rate (TPR) against the false positive rate (FPR) at each threshold. It is a popular classification performance metric and has several nice properties like being independent of decision threshold, being robust to the class imbalance in data and so on.

Back to Top

B

  • Bagging: Bagging is a procedure that produces several different training sets of the same size with replacement and then trains a machine learning model for each set. The predictions are produced by taking a majority vote in a classification task and by averaging in a regression task. Bagging helps in reducing variance from models.
  • Bias-Variance Trade-off: Bias here refers to the difference between the average prediction of a model and target value the model is trying to predict. Variance refers to the variability in the model predictions for a given data point because of its sensitivity to small fluctuations in the training set. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand, if our model has a large number of parameters, then it may have high variance and low bias. Thus, we need to find the right/good balance between bias and variance without overfitting and underfitting the data.
  • Bootstrapping: Bootstrapping is the process of dividing the dataset into multiple subsets, with replacement. Each subset is of the same size of the dataset and the samples are called bootstrap samples. It is used in bagging.
  • Boosting: Boosting is an ensemble method for improving the model predictions of any given learning algorithm. The idea is to train weak learners sequentially, each trying to correct its predecessor, to build strong learners. A weak learner is defined to be a classifier that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

Back to Top

C

  • Classification: Classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known.
  • Clustering: Clustering is an unsupervised learning technique used to group similar data points together based on their features. The goal is to partition the data into clusters where points within each cluster are more similar to each other than to those in other clusters. Common clustering algorithms include k-means, hierarchical clustering, and DBSCAN. Clustering is widely used in various applications such as market segmentation, image segmentation, and anomaly detection.
  • Confusion Matrix: A confusion matrix is a table used to evaluate the performance of a classification model. It compares the actual target values with those predicted by the model. The matrix includes four key metrics: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). From these, other metrics such as accuracy, precision, recall, and F1-score can be derived. The confusion matrix provides a detailed breakdown of the model’s performance.
  • Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is a deep learning algorithm commonly used for image recognition and processing tasks. CNNs are designed to automatically and adaptively learn spatial hierarchies of features through convolutional layers, pooling layers, and fully connected layers. The convolutional layers apply filters to the input image, capturing local patterns such as edges, textures, and shapes. CNNs are widely used in computer vision applications, including image classification, object detection, and segmentation.
  • Correlation: Correlation is a statistical technique that can show whether and how strongly pairs of variables are related. Pearson’s Correlation Coefficient is used to measure the strength of correlation between two variables.
  • Cross Validation: Cross-validation is a statistical technique used to assess the generalization performance of a machine learning model. The most common form is k-fold cross-validation, where the data is divided into k subsets (folds). The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, with each fold used exactly once as the test set. Cross-validation helps in selecting models and tuning hyperparameters by providing a more reliable estimate of model performance.
  • Curse of Dimensionality: In a model, as the number of features or dimensions grows, the amount of data needed to make the model generalizable with good performance grows exponentially, which unnecessarily increases storage space and processing time for a modeling algorithm. In this sense, value added by an additional dimension becomes much smaller compared to the overhead it adds to the algorithm.

Back to Top

D

Back to Top

E

  • Elastic Net Regression: Elastic Net Regression is a regularized regression technique that linearly combines the penalties of the L1 (Lasso) and L2 (Ridge) regularization methods. It is particularly useful when there are multiple correlated features, as it can select groups of correlated variables. The elastic net penalty encourages sparsity (like Lasso) and also includes a ridge regression penalty to maintain stability. This approach balances between the feature selection property of Lasso and the regularization strength of Ridge Regression.
  • Entropy: In the context of machine learning, entropy is a measure of the uncertainty or impurity in a dataset. It quantifies the amount of disorder or randomness. In decision trees, entropy is used to decide the best split at each node by measuring the impurity before and after the split. Lower entropy indicates higher purity, meaning the data is more homogeneous. Entropy is a fundamental concept in information theory and is used to build efficient models by minimizing uncertainty.
  • Ensemble Learning: Ensemble learning is a technique that combines multiple individual models (often called base learners or weak learners) to create a stronger overall model. The primary goal is to improve the predictive performance by reducing variance, bias, or improving accuracy. Popular ensemble methods include bagging (e.g., Random Forest), boosting (e.g., AdaBoost, Gradient Boosting), and stacking. Ensemble learning is widely used in both classification and regression tasks due to its ability to enhance model robustness and accuracy.
  • Empirical Risk Minimization (ERM): Empirical Risk Minimization (ERM) is a principle in statistical learning theory where the goal is to minimize the empirical risk, i.e., the average loss over the training sample. The empirical risk is calculated based on a chosen loss function that quantifies the error between predicted and actual values. ERM is fundamental to the training process of machine learning models, guiding the optimization of model parameters to fit the training data.
    • Useful links: [Empirical Risk Minimization Explained](https://mlweb.loria.fr/book/en/erm.html#:~:text=The%20Empirical%20Risk%20Minimization%20(ERM,alternative%20name%20of%20empirical%20risk.)
  • Epoch: An epoch in machine learning refers to one complete pass of the training dataset through the algorithm during the training process. Training a model typically involves multiple epochs, as each pass helps the model learn the underlying patterns in the data. After each epoch, the model’s parameters are updated to minimize the loss function. The number of epochs is a hyperparameter that needs to be tuned to ensure the model neither underfits nor overfits the training data.
  • Exploratory Data Analysis (EDA): Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often using visual methods. EDA is a crucial step in the data preprocessing phase, helping to understand the data’s structure, detect outliers, find patterns, and test hypotheses. Techniques used in EDA include summary statistics, data visualization (e.g., histograms, box plots), and correlation analysis. EDA provides insights that guide further data cleaning and feature engineering processes.
  • Exponential Smoothing: Exponential smoothing is a time series forecasting method for univariate data. It involves using weighted averages of past observations, where the weights decrease exponentially as the observations get older. The simplest form, single exponential smoothing, is used for data without a trend or seasonality. More complex forms, like double and triple exponential smoothing (Holt-Winters), can capture trends and seasonality. Exponential smoothing is widely used in forecasting applications due to its simplicity and effectiveness.
  • Early Stopping: Early stopping is a regularization technique used to prevent overfitting in machine learning models, particularly in deep learning. It involves monitoring the model’s performance on a validation set during training and stopping the training process once the performance stops improving. This helps in finding the optimal number of epochs for training, ensuring that the model generalizes well to unseen data without overfitting the training data.
  • Embedding: Embedding refers to the representation of categorical data or discrete items in a continuous vector space. This technique is commonly used in natural language processing (NLP) to convert words into dense vectors of real numbers, capturing semantic meaning. Word embeddings like Word2Vec, GloVe, and FastText are examples. Embeddings are also used in recommendation systems, where items (e.g., movies, products) are represented in a latent space to capture similarities and relationships.
  • Evaluation Metrics: Evaluation metrics are measures used to assess the performance of a machine learning model. Common metrics for classification tasks include accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC). For regression tasks, metrics like mean squared error (MSE), mean absolute error (MAE), and R-squared are used. Choosing the right evaluation metric is crucial as it directly impacts model selection, hyperparameter tuning, and overall performance assessment.
  • Evolutionary Algorithms: Evolutionary Algorithms (EAs) are a subset of evolutionary computation, inspired by natural selection and genetics. They are used to solve optimization problems by iteratively improving a set of candidate solutions. Key components include selection, mutation, crossover, and reproduction. Common types of EAs include Genetic Algorithms, Genetic Programming, and Evolutionary Strategies. EAs are particularly useful for solving complex problems where traditional optimization methods are not effective.
  • Expectation-Maximization (EM) Algorithm: The Expectation-Maximization (EM) algorithm is an iterative method used for finding maximum likelihood estimates of parameters in statistical models, especially when the data involves latent variables. It consists of two steps: the Expectation step (E-step), which estimates the missing data given the current parameters, and the Maximization step (M-step), which maximizes the likelihood function based on the estimated data. EM is commonly used in clustering (e.g., Gaussian Mixture Models) and missing data imputation.

Back to Top

F

  • Feature Engineering: Feature engineering is the process of using domain knowledge to create new features or modify existing ones to improve the performance of a machine learning model. This process includes techniques such as normalization, encoding categorical variables, creating interaction terms, and extracting features from date-time data. Effective feature engineering can significantly enhance a model’s predictive power and is often more impactful than choosing complex algorithms.
  • Feature Selection: Feature selection is the process of selecting a subset of relevant features for building a robust machine learning model. The main goal is to improve the model’s performance by reducing overfitting, enhancing generalization, and decreasing training time. Common techniques include filter methods (e.g., mutual information), wrapper methods (e.g., recursive feature elimination), and embedded methods (e.g., Lasso regression). Feature selection helps in identifying the most significant features that contribute to the predictive power of the model.
  • Forward Propagation: Forward propagation is the process of moving input data through the layers of a neural network to generate an output. During this process, each neuron computes a weighted sum of its inputs, applies an activation function, and passes the result to the next layer. Forward propagation is a crucial step in both training and prediction phases of neural networks, as it computes the predicted output for a given input.
  • F1 Score: The F1 Score is a metric used to evaluate the performance of a classification model. It is the harmonic mean of precision and recall, providing a single measure that balances both concerns. The F1 Score is particularly useful when dealing with imbalanced datasets, where focusing solely on accuracy can be misleading. A high F1 Score indicates that the model has both high precision (few false positives) and high recall (few false negatives).
  • Feature Scaling: Feature scaling is a technique used to normalize the range of independent variables or features of data. In machine learning, this is important for algorithms that compute distances between data points, like k-nearest neighbors (KNN) and support vector machines (SVM). Common methods of feature scaling include Min-Max scaling and Standardization (Z-score normalization). Scaling ensures that each feature contributes equally to the model’s performance.
  • Factorization Machines: Factorization Machines are a type of model that generalizes matrix factorization and linear regression, designed to capture interactions between features in high-dimensional sparse datasets. They are particularly effective for recommendation systems and predictive modeling tasks where interactions between variables are significant. Factorization Machines model pairwise interactions among features, providing a powerful way to handle large and sparse data.
  • False Positive Rate (FPR): The False Positive Rate (FPR) is a metric used to measure the proportion of negative instances that are incorrectly classified as positive by a machine learning model. It is calculated as the ratio of false positives (FP) to the sum of false positives and true negatives (TN). FPR is crucial in scenarios where the cost of false positives is high, such as fraud detection or medical diagnosis.
  • Forward Selection: Forward Selection is a feature selection technique used to iteratively add features to a model based on their statistical significance. Starting with no features, the algorithm adds one feature at a time, selecting the one that improves the model the most at each step. This process continues until adding more features does not significantly improve the model. Forward Selection helps in identifying the most important features and building a parsimonious model.
  • Fisher’s Linear Discriminant: Fisher’s Linear Discriminant is a linear classification technique used to find a linear combination of features that best separates two or more classes of objects. It maximizes the ratio of the variance between the classes to the variance within the classes, providing a projection that enhances class separability. Fisher’s Linear Discriminant is widely used in pattern recognition and machine learning for its simplicity and effectiveness.
  • Fine-Tuning: Fine-tuning is a transfer learning technique where a pre-trained model is further trained on a new, typically smaller, dataset. This process involves adjusting the weights of the pre-trained model slightly to better fit the new data while leveraging the knowledge the model has already learned. Fine-tuning is commonly used in natural language processing (NLP) and computer vision tasks to adapt pre-trained models to specific applications.

Back to Top

G

  • Generative Classifiers:
  • Gradient Descent: Gradient Descent is an optimization technique to minimize a loss function by computing the gradients of the loss function with respect to the model’s parameters, conditioned on training data. Informally, gradient descent iteratively adjusts parameters and gradually finding the best combination to minimize the loss.
  • Grid Search:

Back to Top

H

  • Hinge Loss: Hinge loss is used in context of classification problems and is defined as $l(y) = max(0, 1 - t.y)$, where t is the actual output and y is the classifier’s score. Observing the function, we can see that classifier is penalized unless it classifies data points correctly with 100% confidence. This leads to “maximum-margin” classification where each training data point is as far from classifier’s decision boundary as possible.

Back to Top

I

Back to Top

J

  • Jaccard Similarity: Jaccard Similarity is a statistic used for comparing the similarity and diversity of finite sample sets. It is defined as the size of the intersection divided by the size of the union of the sample sets $\left(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\right)$.

Back to Top

K

  • K-Nearest Neighbor: KNN is essentially a classification technique that finds the ($K$) data points in the training data which are most similar to an unseen data point, and takes a majority vote to make classifications. KNN is a non-parametric method which means that it does not make any assumptions on the underlying data distribution. Performance of KNN methods depends on the data representation and the definition of closeness/similarity.
  • Kullback–Leibler Divergence: Kullback–Leibler divergence is a measure of how one probability distribution is different from a second, reference probability distribution. A familiar use case for this is when we replace observed data or complex distributions with a simpler approximating distribution, we can use KL Divergence to measure just how much information we lose when we choose an approximation.

Back to Top

L

  • Lasso Regression:
  • Learning Curve:
  • Linear Discriminant Analysis:
  • Linear Regression: Linear regression models linear relationship between a scalar dependent variable (usually called target) and several independent variables (usually called predictors). It can be used for forecasting outcomes once the model parameters are learned using supervision from a relevant dataset. Additionally, the learned model parameters can also be used to explain the strength of the relationship between the target and the predictors (a procedure known as linear regression analysis). The model parameters are usually learned by minimizing mean squared error.
  • Logistic Regression: Logistic regression models the probability of a certain binary outcome given some predictor variables which influence the outcome. It uses a linear function on predictor variables like linear regression but then transforms it into a probability using the logistic function $\left( \sigma(z) = \frac{1}{1 + e^{-z}} \right)$. The model parameters are usually learned by maximizing likelihood of observed data.

Back to Top

M

  • Maximum Likelihood Estimation: Maximum likelihood estimation is a method of estimating the parameters of a statistical model $\theta$ such that the likelihood function $L(\theta; x)$, which is a function of model parameters given observed data $x$, is maximized. Intuitively, this selects the parameters $\theta$ that make the observed data most probable.
  • Model Selection:

Back to Top

N

Back to Top

O

  • Ordinal Classification: Same as Ordinal Regression.
  • Ordinal Regression: Ordinal Regression is used for predicting ordinal outcomes, i.e. whose value exists on an arbitrary scale where only the relative ordering between different values is significant, based on various predictor variables. That is why, it is considered as an intermediate problem between regression and classification. Usually ordinal regression problem is reduced to multiple binary classification problems with the help of threshold parameters such that classifier’s score falling within certain threshold correspond to one of the ordinal outcomes.

Back to Top

P

  • Pearson’s Correlation Coefficient: Correlation coefficient ($\rho$) ranges from -1 to +1. The closer $\rho$ is to +1 or -1, the more closely the two variables are related and if it is close to 0, the variables have no relation with each other. It is defined as $\rho_{X, Y} = \frac{Cov(X, Y)}{\sigma_{X}.\sigma_{Y}}$.
  • Precision: If we are given a set of instances, precision is the fraction of relevant instances (those correctly classified into a certain class $C$) among the retrieved instances (those belonging to a certain class $C$). A perfect precision score of 1.0 means that every result retrieved by a search was relevant, but says nothing about whether all relevant documents were retrieved.
  • Principle Component Analysis: PCA is a statistical procedure that transforms a set of observations of possibly correlated variables into a set of observations with linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component variance in decreasing order with the constraint that it is orthogonal to the preceding components. Utilizing only a few components that capture most of the variance in data helps in fighting the Curse of Dimensionality.
  • Pruning:

Back to Top

Q

Back to Top

R

  • $R^2$:
  • Random Forest: Random Forest is a supervised learning algorithm that builds an ensemble of Decision Trees, where each decision tree is allowed to use a fixed number of randomly chosen features. The decision trees are trained using the Bagging technique and the output of trees are merged together to get a more accurate and stable prediction.
  • Recall: If we are given a set of instances, recall is the fraction of relevant instances (belonging to a certain class $C$) that have been retrieved (or correctly classified in $C$) over the total number of relevant instances. A recall of 1.0 means that every item from class $C$ was labeled as belonging to class $C$, but does not say anything about other items that were incorrectly labeled as belonging to class $C$.
  • Regression: Regression is the problem of approximating a mapping function ($f$) from input variables ($X$) to a continuous output variable ($y$), on the basis of a training set of data containing observations in the form of input-output pairs.
  • Relative Entropy: See Kullback–Leibler Divergence.
  • Ridge Regression:

Back to Top

S

  • Sensitivity: Same as Recall.
  • Specificity: If we are given a set of instances, specificity measures the proportion of actual negatives (instances not belonging to a particular class) that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).
  • Standard Score: Same as Z-score.
  • Standard Error:
  • Stratified Cross Validation:
  • Supervised Learning: Supervised learning is a task of learning a function that can map an unseen input to an output as accurately as possible based on the example input-output pairs known as training data.
  • Support Vector Machine: Support Vector Machine (SVM), in simplest terms, is a classification algorithm which aims to find a decision boundary that separates two classes such that the closest data points from either class are as far as possible. Having a good margin between the two classes contributes to robustness and generalizability of SVM.

Back to Top

T

  • T-Test: The t-test is one type of inferential statistics that is used to determine whether there is a significant difference between the means of two groups. T-test assumes that the two groups follow a normal distribution and calculates the t-value (extension of z-score), which is inextricably linked to certain probability value (p-value). P-value denotes the likelihood of two distribution being the same and if the value is below a certain agreed-upon threshold, t-test concludes that the two distributions are sufficiently different.
  • True Positive Rate: Same as Recall.
  • True Negative Rate: Same as Specificity.

Back to Top

U

Back to Top

V

Back to Top

W

Back to Top

X

Back to Top

Y

Back to Top

Z

  • Z-score: Z-score is a measure of how many standard deviations below or above the population mean a raw score is, thus giving us a good picture when we want to compare results from a test to a “normal” population.

Back to Top

Written on January 1, 2019