Introduction
Distance calculations are fundamental in many domains, such as machine learning, statistics, and data analysis. Mahalanobis distance stands out for its ability to measure the distance between a point and a distribution, accounting for correlations between variables. Unlike Euclidean distance, which assumes that all dimensions contribute equally, Mahalanobis distance scales dimensions based on the data’s covariance matrix, providing a more accurate distance metric for multivariate data.
Why is Mahalanobis Distance Important?
Mahalanobis distance is widely used in anomaly detection, clustering, and classification. In anomaly detection, it can identify outliers based on how far they deviate from the “center” of a distribution. The distance accounts not only for the magnitude of deviation but also the correlations among variables, making it an effective tool when working with multidimensional data.
In this post, we will walk through:
- What the Mahalanobis Distance Is.
- Step-by-Step Calculation.
- Installing and Setting Up the Eigen Library for C++.
- Efficient Code Implementation in C++ using Eigen.
- Examples: Singular and Non-Singular Covariance Matrices.
- Pros and Cons of Using Mahalanobis Distance.
What is the Mahalanobis Distance?
The Mahalanobis distance is a measure of the distance between a point and a distribution. It accounts for the correlations between variables in the data by considering the covariance matrix. This makes it more suitable than Euclidean distance for high-dimensional datasets where the variables are interrelated.
Mathematically, for a point [latex] x [/latex] and a distribution with mean [latex] \mu [/latex] and covariance matrix [latex] S [/latex], the Mahalanobis distance is defined as:
$$D_M(x) = \sqrt{(x – \mu)^T S^{-1} (x – \mu)}$$
Where:
- [latex] x [/latex] is the point to test.
- [latex] \mu [/latex] is the mean of the distribution.
- [latex] S^{-1} [/latex] is the inverse of the covariance matrix.
The covariance matrix [latex] S [/latex] contains the variances of the features along the diagonal and covariances between the features in the off-diagonal elements. If the covariance matrix is singular (non-invertible), we must compute a pseudo-inverse to handle the calculation.
Step-by-Step Calculation of Mahalanobis Distance
To compute the Mahalanobis distance, follow these steps:
- Compute the Mean Vector: Calculate the mean of each feature in the dataset.
- Calculate the Covariance Matrix: The covariance matrix contains the variances and covariances of the features.
- Compute the Inverse of the Covariance Matrix: This is crucial for calculating the distance. If the matrix is singular, a pseudo-inverse must be used.
- Subtract the Mean from the point.
- Apply the Mahalanobis Formula to compute the distance.
Installing and Setting Up the Eigen Library for C++
We’ll use the Eigen library for matrix operations in C++. Here’s how to install and set up Eigen on different operating systems.
macOS
Using Homebrew:
brew install eigen
Manual Installation
- Download Eigen from Eigen’s official website.
- Extract the files and place them in
/usr/local/include
or within your project folder.
Linux
Using APT (Debian/Ubuntu-based systems):
sudo apt-get install libeigen3-dev
Manual Installation:
- Download and extract Eigen from the official website.
- Place the files in
/usr/local/include
or your project folder.
Windows
Using vcpkg (Microsoft’s package manager for C++):
vcpkg install eigen3
- Make sure to integrate vcpkg with your project:
./vcpkg integrate install
Manual Installation:
- Download Eigen, extract it, and place the folder in a directory (e.g.,
C:\eigen
). - Configure your project settings to include the Eigen folder in the compiler’s include path.
Configuring Eigen in CMake (For All OS)
Add the following lines to your CMakeLists.txt
file to include Eigen:
cmake_minimum_required(VERSION 3.10) project(mahalanobis_example) set(CMAKE_CXX_STANDARD 20) # Find Eigen3 find_package(Eigen3 REQUIRED) include_directories(${EIGEN3_INCLUDE_DIR}) # Add the executable add_executable(mahalanobis_example main.cpp)
Rebuild your project to ensure Eigen is properly linked and ready to use.
Code Implementation in C++ Using Eigen
Now that Eigen is set up, let’s write the code to calculate Mahalanobis distance in C++. We’ll handle both singular and non-singular covariance matrices.
#include <iostream> #include <Eigen/Dense> // Function to calculate the mean vector of the dataset Eigen::VectorXd calculateMean(const Eigen::MatrixXd &data) { return data.colwise().mean(); // Calculate the mean of each column (feature) } // Function to calculate the covariance matrix of the dataset Eigen::MatrixXd calculateCovariance(const Eigen::MatrixXd &data, const Eigen::VectorXd &mean) { Eigen::MatrixXd centered = data.rowwise() - mean.transpose(); // Center data by subtracting mean return (centered.adjoint() * centered) / double(data.rows() - 1); // Covariance matrix } // Function to calculate the pseudo-inverse of the matrix Eigen::MatrixXd pseudoInverse(const Eigen::MatrixXd &matrix) { Eigen::JacobiSVD<Eigen::MatrixXd> svd(matrix, Eigen::ComputeThinU | Eigen::ComputeThinV); double tolerance = std::numeric_limits<double>::epsilon() * std::max(matrix.cols(), matrix.rows()) * svd.singularValues().array().abs()(0); return svd.matrixV() * (svd.singularValues().array().abs() > tolerance).select(svd.singularValues().array().inverse(), 0).matrix().asDiagonal() * svd.matrixU().adjoint(); } // Function to calculate the Mahalanobis distance double mahalanobisDistance(const Eigen::VectorXd &x, const Eigen::VectorXd &mean, const Eigen::MatrixXd &covariance) { Eigen::MatrixXd covInverse; if (covariance.determinant() == 0) { std::cerr << "Covariance matrix is singular, using pseudo-inverse instead." << std::endl; covInverse = pseudoInverse(covariance); } else { covInverse = covariance.inverse(); // Use standard inverse if covariance matrix is non-singular } Eigen::VectorXd diff = x - mean; // Difference vector (x - mean) return std::sqrt(diff.transpose() * covInverse * diff); // Mahalanobis distance formula }
Explanation of the Code Implementation
This code calculates the Mahalanobis distance between a point and a distribution using matrix operations in C++. Here’s a breakdown of each part:
- Mean Vector Calculation (
calculateMean
):- This function computes the mean of each column (feature) in the dataset, which represents the average value for each variable.
- Covariance Matrix Calculation (
calculateCovariance
):- The covariance matrix is computed by centering the data (subtracting the mean from each row) and calculating the covariance between the variables. The covariance matrix captures the relationships between the variables in the dataset.
- Pseudo-Inverse Calculation (
pseudoInverse
):- If the covariance matrix is singular (non-invertible), we calculate the Moore-Penrose pseudoinverse using Singular Value Decomposition (SVD). This allows us to compute a meaningful distance even when the covariance matrix doesn’t have a standard inverse.
- Mahalanobis Distance Calculation (
mahalanobisDistance
):- This function calculates the Mahalanobis distance using the formula [latex] D_M(x) = \sqrt{(x – \mu)^\top S^{-1} (x – \mu)} [/latex].
- If the covariance matrix is singular, the pseudoinverse is used. Otherwise, the standard inverse is applied.
Examples: Singular and Non-Singular Covariance Matrices with Scipy Verification
We’ll now present two examples. For each, we will calculate the Mahalanobis distance in C++ and compare the results by verifying them with Python’s scipy.spatial.distance.mahalanobis
function.
Example 1: Singular Covariance Matrix
int main() { // Example dataset: Each row is a point, and each column is a feature Eigen::MatrixXd data(5, 3); data << 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5; // Point to calculate the Mahalanobis distance for Eigen::VectorXd point(3); point << 3.0, 4.0, 5.0; // Step 1: Calculate the mean vector Eigen::VectorXd mean = calculateMean(data); // Step 2: Calculate the covariance matrix Eigen::MatrixXd covariance = calculateCovariance(data, mean); // Step 3: Calculate the Mahalanobis distance double distance = mahalanobisDistance(point, mean, covariance); // Output the result std::cout << "Mahalanobis Distance (Singular Case): " << distance << std::endl; return 0; }
Output:
Covariance matrix is singular, using pseudo-inverse instead. Mahalanobis Distance (Singular Case): 0.246494
To verify this result in Python:
import numpy as np from scipy.spatial.distance import mahalanobis # Example dataset data = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0], [1.5, 2.5, 3.5], [4.5, 5.5, 6.5]]) # Point to test point = np.array([3.0, 4.0, 5.0]) # Step 1: Calculate the mean vector mean = np.mean(data, axis=0) # Step 2: Calculate the covariance matrix covariance = np.cov(data, rowvar=False) # Step 3: Calculate Mahalanobis distance using scipy inv_cov = np.linalg.pinv(covariance) # Use pseudo-inverse distance = mahalanobis(point, mean, inv_cov) # Round the result to 6 significant figures distance_rounded = round(distance, 6) print("Mahalanobis Distance (rounded to 6 sig figs):", distance_rounded)
Output:
Mahalanobis Distance (rounded to 6 sig figs): 0.246494
Which is the same as the C++ result.
Example 2: Non-Singular Covariance Matrix
int main() { // Example dataset with independent features Eigen::MatrixXd data(5, 3); data << 1.0, 2.0, 3.0, 2.0, 3.0, 5.0, 3.0, 6.0, 7.0, 4.0, 8.0, 9.0, 5.0, 9.0, 10.0; // Point to calculate the Mahalanobis distance for Eigen::VectorXd point(3); point << 3.5, 5.5, 6.5; // Step 1: Calculate the mean vector Eigen::VectorXd mean = calculateMean(data); // Step 2: Calculate the covariance matrix Eigen::MatrixXd covariance = calculateCovariance(data, mean); // Step 3: Calculate the Mahalanobis distance double distance = mahalanobisDistance(point, mean, covariance); // Output the result std::cout << "Mahalanobis Distance (Non-Singular Case): " << distance << std::endl; return 0; }
Output:
Mahalanobis Distance (Non-Singular Case): 3.82473
To verify this result in Python:
import numpy as np from scipy.spatial.distance import mahalanobis # Example dataset data = np.array([[1.0, 2.0, 3.0], [2.0, 3.0, 5.0], [3.0, 6.0, 7.0], [4.0, 8.0, 9.0], [5.0, 9.0, 10.0]]) # Point to test point = np.array([3.5, 5.5, 6.5]) # Step 1: Calculate the mean vector mean = np.mean(data, axis=0) # Step 2: Calculate the covariance matrix covariance = np.cov(data, rowvar=False) # Step 3: Calculate Mahalanobis distance using scipy inv_cov = np.linalg.inv(covariance) # Use the standard inverse distance = mahalanobis(point, mean, inv_cov) # Option 1: Round the result to 5 decimal places distance_rounded = round(distance, 5) print("Mahalanobis Distance (rounded to 5 decimal places):", distance_rounded)
Output:
Mahalanobis Distance (rounded to 5 decimal places): 3.82473
Pros and Cons of Using Mahalanobis Distance
Pros:
- Accounts for Correlations: Mahalanobis distance takes into account the correlation between variables, making it more robust than Euclidean distance for multivariate data.
- Effective for Outlier Detection: It’s widely used for detecting outliers in multivariate datasets.
- Invariant to Scale: It adjusts for the scale of the data automatically.
Cons:
- Sensitive to Covariance Matrix: If the covariance matrix is singular, special care (e.g., pseudo-inverse) must be taken.
- Computationally Expensive: Calculating the inverse or pseudo-inverse can be expensive for large datasets.
- Sensitive to Noise: In noisy data, the Mahalanobis distance can become unstable, especially when the covariance matrix is poorly conditioned.
Conclusion
The Mahalanobis distance is a powerful tool for measuring multivariate distances, with applications ranging from outlier detection to classification. By accounting for correlations between variables, it provides a more accurate distance measure than Euclidean distance, especially for high-dimensional datasets.
Congratulations on reading to the end of this tutorial!
For further reading on distance calculations in C++ go to the article How to Calculate Hamming Distance in C++ (With C++20, Bit Shifting, Strings, and More).
Have fun and happy researching!
Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.