Introduction
Distance calculations are fundamental in many domains, such as machine learning, statistics, and data analysis. Mahalanobis distance stands out for its ability to measure the distance between a point and a distribution, accounting for correlations between variables. Unlike Euclidean distance, which assumes that all dimensions contribute equally, Mahalanobis distance scales dimensions based on the data’s covariance matrix, providing a more accurate distance metric for multivariate data.
Why is Mahalanobis Distance Important?
Mahalanobis distance is widely used in anomaly detection, clustering, and classification. In anomaly detection, it can identify outliers based on how far they deviate from the “center” of a distribution. The distance accounts not only for the magnitude of deviation but also the correlations among variables, making it an effective tool when working with multidimensional data.
In this post, we will walk through:
- What the Mahalanobis Distance Is.
- Step-by-Step Calculation.
- Installing and Setting Up the Eigen Library for C++.
- Efficient Code Implementation in C++ using Eigen.
- Examples: Singular and Non-Singular Covariance Matrices.
- Pros and Cons of Using Mahalanobis Distance.
What is the Mahalanobis Distance?
The Mahalanobis distance is a measure of the distance between a point and a distribution. It accounts for the correlations between variables in the data by considering the covariance matrix. This makes it more suitable than Euclidean distance for high-dimensional datasets where the variables are interrelated.
Mathematically, for a point [latex] x [/latex] and a distribution with mean [latex] \mu [/latex] and covariance matrix [latex] S [/latex], the Mahalanobis distance is defined as:
$$D_M(x) = \sqrt{(x – \mu)^T S^{-1} (x – \mu)}$$
Where:
- [latex] x [/latex] is the point to test.
- [latex] \mu [/latex] is the mean of the distribution.
- [latex] S^{-1} [/latex] is the inverse of the covariance matrix.
The covariance matrix [latex] S [/latex] contains the variances of the features along the diagonal and covariances between the features in the off-diagonal elements. If the covariance matrix is singular (non-invertible), we must compute a pseudo-inverse to handle the calculation.
Step-by-Step Calculation of Mahalanobis Distance
To compute the Mahalanobis distance, follow these steps:
- Compute the Mean Vector: Calculate the mean of each feature in the dataset.
- Calculate the Covariance Matrix: The covariance matrix contains the variances and covariances of the features.
- Compute the Inverse of the Covariance Matrix: This is crucial for calculating the distance. If the matrix is singular, a pseudo-inverse must be used.
- Subtract the Mean from the point.
- Apply the Mahalanobis Formula to compute the distance.
Installing and Setting Up the Eigen Library for C++
We’ll use the Eigen library for matrix operations in C++. Here’s how to install and set up Eigen on different operating systems.
macOS
Using Homebrew:
brew install eigen
Manual Installation
- Download Eigen from Eigen’s official website.
- Extract the files and place them in
/usr/local/includeor within your project folder.
Linux
Using APT (Debian/Ubuntu-based systems):
sudo apt-get install libeigen3-dev
Manual Installation:
- Download and extract Eigen from the official website.
- Place the files in
/usr/local/includeor your project folder.
Windows
Using vcpkg (Microsoft’s package manager for C++):
vcpkg install eigen3
- Make sure to integrate vcpkg with your project:
./vcpkg integrate install
Manual Installation:
- Download Eigen, extract it, and place the folder in a directory (e.g.,
C:\eigen). - Configure your project settings to include the Eigen folder in the compiler’s include path.
Configuring Eigen in CMake (For All OS)
Add the following lines to your CMakeLists.txt file to include Eigen:
cmake_minimum_required(VERSION 3.10)
project(mahalanobis_example)
set(CMAKE_CXX_STANDARD 20)
# Find Eigen3
find_package(Eigen3 REQUIRED)
include_directories(${EIGEN3_INCLUDE_DIR})
# Add the executable
add_executable(mahalanobis_example main.cpp)
Rebuild your project to ensure Eigen is properly linked and ready to use.
Code Implementation in C++ Using Eigen
Now that Eigen is set up, let’s write the code to calculate Mahalanobis distance in C++. We’ll handle both singular and non-singular covariance matrices.
#include <iostream>
#include <Eigen/Dense>
// Function to calculate the mean vector of the dataset
Eigen::VectorXd calculateMean(const Eigen::MatrixXd &data) {
return data.colwise().mean(); // Calculate the mean of each column (feature)
}
// Function to calculate the covariance matrix of the dataset
Eigen::MatrixXd calculateCovariance(const Eigen::MatrixXd &data, const Eigen::VectorXd &mean) {
Eigen::MatrixXd centered = data.rowwise() - mean.transpose(); // Center data by subtracting mean
return (centered.adjoint() * centered) / double(data.rows() - 1); // Covariance matrix
}
// Function to calculate the pseudo-inverse of the matrix
Eigen::MatrixXd pseudoInverse(const Eigen::MatrixXd &matrix) {
Eigen::JacobiSVD<Eigen::MatrixXd> svd(matrix, Eigen::ComputeThinU | Eigen::ComputeThinV);
double tolerance = std::numeric_limits<double>::epsilon() * std::max(matrix.cols(), matrix.rows()) * svd.singularValues().array().abs()(0);
return svd.matrixV() * (svd.singularValues().array().abs() > tolerance).select(svd.singularValues().array().inverse(), 0).matrix().asDiagonal() * svd.matrixU().adjoint();
}
// Function to calculate the Mahalanobis distance
double mahalanobisDistance(const Eigen::VectorXd &x, const Eigen::VectorXd &mean, const Eigen::MatrixXd &covariance) {
Eigen::MatrixXd covInverse;
if (covariance.determinant() == 0) {
std::cerr << "Covariance matrix is singular, using pseudo-inverse instead." << std::endl;
covInverse = pseudoInverse(covariance);
} else {
covInverse = covariance.inverse(); // Use standard inverse if covariance matrix is non-singular
}
Eigen::VectorXd diff = x - mean; // Difference vector (x - mean)
return std::sqrt(diff.transpose() * covInverse * diff); // Mahalanobis distance formula
}
Explanation of the Code Implementation
This code calculates the Mahalanobis distance between a point and a distribution using matrix operations in C++. Here’s a breakdown of each part:
- Mean Vector Calculation (
calculateMean):- This function computes the mean of each column (feature) in the dataset, which represents the average value for each variable.
- Covariance Matrix Calculation (
calculateCovariance):- The covariance matrix is computed by centering the data (subtracting the mean from each row) and calculating the covariance between the variables. The covariance matrix captures the relationships between the variables in the dataset.
- Pseudo-Inverse Calculation (
pseudoInverse):- If the covariance matrix is singular (non-invertible), we calculate the Moore-Penrose pseudoinverse using Singular Value Decomposition (SVD). This allows us to compute a meaningful distance even when the covariance matrix doesn’t have a standard inverse.
- Mahalanobis Distance Calculation (
mahalanobisDistance):- This function calculates the Mahalanobis distance using the formula [latex] D_M(x) = \sqrt{(x – \mu)^\top S^{-1} (x – \mu)} [/latex].
- If the covariance matrix is singular, the pseudoinverse is used. Otherwise, the standard inverse is applied.
Examples: Singular and Non-Singular Covariance Matrices with Scipy Verification
We’ll now present two examples. For each, we will calculate the Mahalanobis distance in C++ and compare the results by verifying them with Python’s scipy.spatial.distance.mahalanobis function.
Example 1: Singular Covariance Matrix
int main() {
// Example dataset: Each row is a point, and each column is a feature
Eigen::MatrixXd data(5, 3);
data << 1.0, 2.0, 3.0,
4.0, 5.0, 6.0,
7.0, 8.0, 9.0,
1.5, 2.5, 3.5,
4.5, 5.5, 6.5;
// Point to calculate the Mahalanobis distance for
Eigen::VectorXd point(3);
point << 3.0, 4.0, 5.0;
// Step 1: Calculate the mean vector
Eigen::VectorXd mean = calculateMean(data);
// Step 2: Calculate the covariance matrix
Eigen::MatrixXd covariance = calculateCovariance(data, mean);
// Step 3: Calculate the Mahalanobis distance
double distance = mahalanobisDistance(point, mean, covariance);
// Output the result
std::cout << "Mahalanobis Distance (Singular Case): " << distance << std::endl;
return 0;
}
Output:
Covariance matrix is singular, using pseudo-inverse instead. Mahalanobis Distance (Singular Case): 0.246494
To verify this result in Python:
import numpy as np
from scipy.spatial.distance import mahalanobis
# Example dataset
data = np.array([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0],
[1.5, 2.5, 3.5],
[4.5, 5.5, 6.5]])
# Point to test
point = np.array([3.0, 4.0, 5.0])
# Step 1: Calculate the mean vector
mean = np.mean(data, axis=0)
# Step 2: Calculate the covariance matrix
covariance = np.cov(data, rowvar=False)
# Step 3: Calculate Mahalanobis distance using scipy
inv_cov = np.linalg.pinv(covariance) # Use pseudo-inverse
distance = mahalanobis(point, mean, inv_cov)
# Round the result to 6 significant figures
distance_rounded = round(distance, 6)
print("Mahalanobis Distance (rounded to 6 sig figs):", distance_rounded)
Output:
Mahalanobis Distance (rounded to 6 sig figs): 0.246494
Which is the same as the C++ result.
Example 2: Non-Singular Covariance Matrix
int main() {
// Example dataset with independent features
Eigen::MatrixXd data(5, 3);
data << 1.0, 2.0, 3.0,
2.0, 3.0, 5.0,
3.0, 6.0, 7.0,
4.0, 8.0, 9.0,
5.0, 9.0, 10.0;
// Point to calculate the Mahalanobis distance for
Eigen::VectorXd point(3);
point << 3.5, 5.5, 6.5;
// Step 1: Calculate the mean vector
Eigen::VectorXd mean = calculateMean(data);
// Step 2: Calculate the covariance matrix
Eigen::MatrixXd covariance = calculateCovariance(data, mean);
// Step 3: Calculate the Mahalanobis distance
double distance = mahalanobisDistance(point, mean, covariance);
// Output the result
std::cout << "Mahalanobis Distance (Non-Singular Case): " << distance << std::endl;
return 0;
}
Output:
Mahalanobis Distance (Non-Singular Case): 3.82473
To verify this result in Python:
import numpy as np
from scipy.spatial.distance import mahalanobis
# Example dataset
data = np.array([[1.0, 2.0, 3.0],
[2.0, 3.0, 5.0],
[3.0, 6.0, 7.0],
[4.0, 8.0, 9.0],
[5.0, 9.0, 10.0]])
# Point to test
point = np.array([3.5, 5.5, 6.5])
# Step 1: Calculate the mean vector
mean = np.mean(data, axis=0)
# Step 2: Calculate the covariance matrix
covariance = np.cov(data, rowvar=False)
# Step 3: Calculate Mahalanobis distance using scipy
inv_cov = np.linalg.inv(covariance) # Use the standard inverse
distance = mahalanobis(point, mean, inv_cov)
# Option 1: Round the result to 5 decimal places
distance_rounded = round(distance, 5)
print("Mahalanobis Distance (rounded to 5 decimal places):", distance_rounded)
Output:
Mahalanobis Distance (rounded to 5 decimal places): 3.82473
Pros and Cons of Using Mahalanobis Distance
Pros:
- Accounts for Correlations: Mahalanobis distance takes into account the correlation between variables, making it more robust than Euclidean distance for multivariate data.
- Effective for Outlier Detection: It’s widely used for detecting outliers in multivariate datasets.
- Invariant to Scale: It adjusts for the scale of the data automatically.
Cons:
- Sensitive to Covariance Matrix: If the covariance matrix is singular, special care (e.g., pseudo-inverse) must be taken.
- Computationally Expensive: Calculating the inverse or pseudo-inverse can be expensive for large datasets.
- Sensitive to Noise: In noisy data, the Mahalanobis distance can become unstable, especially when the covariance matrix is poorly conditioned.
Conclusion
The Mahalanobis distance is a powerful tool for measuring multivariate distances, with applications ranging from outlier detection to classification. By accounting for correlations between variables, it provides a more accurate distance measure than Euclidean distance, especially for high-dimensional datasets.
Congratulations on reading to the end of this tutorial!
For further reading on distance calculations in C++ go to the article How to Calculate Hamming Distance in C++ (With C++20, Bit Shifting, Strings, and More).
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.
