Mahalanobis Distance in C++

by Suf | C++, Data Science, Programming, Tips

Introduction

Distance calculations are fundamental in many domains, such as machine learning, statistics, and data analysis. Mahalanobis distance stands out for its ability to measure the distance between a point and a distribution, accounting for correlations between variables. Unlike Euclidean distance, which assumes that all dimensions contribute equally, Mahalanobis distance scales dimensions based on the data’s covariance matrix, providing a more accurate distance metric for multivariate data.

Why is Mahalanobis Distance Important?

Mahalanobis distance is widely used in anomaly detection, clustering, and classification. In anomaly detection, it can identify outliers based on how far they deviate from the “center” of a distribution. The distance accounts not only for the magnitude of deviation but also the correlations among variables, making it an effective tool when working with multidimensional data.

In this post, we will walk through:

What the Mahalanobis Distance Is.
Step-by-Step Calculation.
Installing and Setting Up the Eigen Library for C++.
Efficient Code Implementation in C++ using Eigen.
Examples: Singular and Non-Singular Covariance Matrices.
Pros and Cons of Using Mahalanobis Distance.

What is the Mahalanobis Distance?

The Mahalanobis distance is a measure of the distance between a point and a distribution. It accounts for the correlations between variables in the data by considering the covariance matrix. This makes it more suitable than Euclidean distance for high-dimensional datasets where the variables are interrelated.

Mathematically, for a point [latex] x [/latex] and a distribution with mean [latex] \mu [/latex] and covariance matrix [latex] S [/latex], the Mahalanobis distance is defined as:

$$D_M(x) = \sqrt{(x – \mu)^T S^{-1} (x – \mu)}$$

Where:

[latex] x [/latex] is the point to test.
[latex] \mu [/latex] is the mean of the distribution.
[latex] S^{-1} [/latex] is the inverse of the covariance matrix.

The covariance matrix [latex] S [/latex] contains the variances of the features along the diagonal and covariances between the features in the off-diagonal elements. If the covariance matrix is singular (non-invertible), we must compute a pseudo-inverse to handle the calculation.

Step-by-Step Calculation of Mahalanobis Distance

To compute the Mahalanobis distance, follow these steps:

Compute the Mean Vector: Calculate the mean of each feature in the dataset.
Calculate the Covariance Matrix: The covariance matrix contains the variances and covariances of the features.
Compute the Inverse of the Covariance Matrix: This is crucial for calculating the distance. If the matrix is singular, a pseudo-inverse must be used.
Subtract the Mean from the point.
Apply the Mahalanobis Formula to compute the distance.

Installing and Setting Up the Eigen Library for C++

We’ll use the Eigen library for matrix operations in C++. Here’s how to install and set up Eigen on different operating systems.

macOS

Using Homebrew:

brew install eigen

Manual Installation

Download Eigen from Eigen’s official website.
Extract the files and place them in /usr/local/include or within your project folder.

Linux

Using APT (Debian/Ubuntu-based systems):

sudo apt-get install libeigen3-dev

Manual Installation:

Download and extract Eigen from the official website.
Place the files in /usr/local/include or your project folder.

Windows

Using vcpkg (Microsoft’s package manager for C++):

vcpkg install eigen3

Make sure to integrate vcpkg with your project:

./vcpkg integrate install

Manual Installation:

Download Eigen, extract it, and place the folder in a directory (e.g., C:\eigen).
Configure your project settings to include the Eigen folder in the compiler’s include path.

Configuring Eigen in CMake (For All OS)

Add the following lines to your CMakeLists.txt file to include Eigen:

cmake_minimum_required(VERSION 3.10)
project(mahalanobis_example)

set(CMAKE_CXX_STANDARD 20)

# Find Eigen3
find_package(Eigen3 REQUIRED)
include_directories(${EIGEN3_INCLUDE_DIR})

# Add the executable
add_executable(mahalanobis_example main.cpp)

Rebuild your project to ensure Eigen is properly linked and ready to use.

Code Implementation in C++ Using Eigen

Now that Eigen is set up, let’s write the code to calculate Mahalanobis distance in C++. We’ll handle both singular and non-singular covariance matrices.

#include <iostream>
#include <Eigen/Dense>

// Function to calculate the mean vector of the dataset
Eigen::VectorXd calculateMean(const Eigen::MatrixXd &data) {
    return data.colwise().mean();  // Calculate the mean of each column (feature)
}

// Function to calculate the covariance matrix of the dataset
Eigen::MatrixXd calculateCovariance(const Eigen::MatrixXd &data, const Eigen::VectorXd &mean) {
    Eigen::MatrixXd centered = data.rowwise() - mean.transpose();  // Center data by subtracting mean
    return (centered.adjoint() * centered) / double(data.rows() - 1);  // Covariance matrix
}

// Function to calculate the pseudo-inverse of the matrix
Eigen::MatrixXd pseudoInverse(const Eigen::MatrixXd &matrix) {
    Eigen::JacobiSVD<Eigen::MatrixXd> svd(matrix, Eigen::ComputeThinU | Eigen::ComputeThinV);
    double tolerance = std::numeric_limits<double>::epsilon() * std::max(matrix.cols(), matrix.rows()) * svd.singularValues().array().abs()(0);
    return svd.matrixV() * (svd.singularValues().array().abs() > tolerance).select(svd.singularValues().array().inverse(), 0).matrix().asDiagonal() * svd.matrixU().adjoint();
}

// Function to calculate the Mahalanobis distance
double mahalanobisDistance(const Eigen::VectorXd &x, const Eigen::VectorXd &mean, const Eigen::MatrixXd &covariance) {
    Eigen::MatrixXd covInverse;
    if (covariance.determinant() == 0) {
        std::cerr << "Covariance matrix is singular, using pseudo-inverse instead." << std::endl;
        covInverse = pseudoInverse(covariance);
    } else {
        covInverse = covariance.inverse();  // Use standard inverse if covariance matrix is non-singular
    }
    
    Eigen::VectorXd diff = x - mean;  // Difference vector (x - mean)
    return std::sqrt(diff.transpose() * covInverse * diff);  // Mahalanobis distance formula
}

Explanation of the Code Implementation

This code calculates the Mahalanobis distance between a point and a distribution using matrix operations in C++. Here’s a breakdown of each part:

Mean Vector Calculation (calculateMean):
- This function computes the mean of each column (feature) in the dataset, which represents the average value for each variable.
Covariance Matrix Calculation (calculateCovariance):
- The covariance matrix is computed by centering the data (subtracting the mean from each row) and calculating the covariance between the variables. The covariance matrix captures the relationships between the variables in the dataset.
Pseudo-Inverse Calculation (pseudoInverse):
- If the covariance matrix is singular (non-invertible), we calculate the Moore-Penrose pseudoinverse using Singular Value Decomposition (SVD). This allows us to compute a meaningful distance even when the covariance matrix doesn’t have a standard inverse.
Mahalanobis Distance Calculation (mahalanobisDistance):
- This function calculates the Mahalanobis distance using the formula [latex] D_M(x) = \sqrt{(x – \mu)^\top S^{-1} (x – \mu)} [/latex].
- If the covariance matrix is singular, the pseudoinverse is used. Otherwise, the standard inverse is applied.

Examples: Singular and Non-Singular Covariance Matrices with Scipy Verification

We’ll now present two examples. For each, we will calculate the Mahalanobis distance in C++ and compare the results by verifying them with Python’s scipy.spatial.distance.mahalanobis function.

Example 1: Singular Covariance Matrix

int main() {
    // Example dataset: Each row is a point, and each column is a feature
    Eigen::MatrixXd data(5, 3);
    data << 1.0, 2.0, 3.0,
            4.0, 5.0, 6.0,
            7.0, 8.0, 9.0,
            1.5, 2.5, 3.5,
            4.5, 5.5, 6.5;

    // Point to calculate the Mahalanobis distance for
    Eigen::VectorXd point(3);
    point << 3.0, 4.0, 5.0;

    // Step 1: Calculate the mean vector
    Eigen::VectorXd mean = calculateMean(data);

    // Step 2: Calculate the covariance matrix
    Eigen::MatrixXd covariance = calculateCovariance(data, mean);

    // Step 3: Calculate the Mahalanobis distance
    double distance = mahalanobisDistance(point, mean, covariance);

    // Output the result
    std::cout << "Mahalanobis Distance (Singular Case): " << distance << std::endl;

    return 0;
}

Output:

Covariance matrix is singular, using pseudo-inverse instead.
Mahalanobis Distance (Singular Case): 0.246494

To verify this result in Python:

import numpy as np
from scipy.spatial.distance import mahalanobis

# Example dataset
data = np.array([[1.0, 2.0, 3.0],
                 [4.0, 5.0, 6.0],
                 [7.0, 8.0, 9.0],
                 [1.5, 2.5, 3.5],
                 [4.5, 5.5, 6.5]])

# Point to test
point = np.array([3.0, 4.0, 5.0])

# Step 1: Calculate the mean vector
mean = np.mean(data, axis=0)

# Step 2: Calculate the covariance matrix
covariance = np.cov(data, rowvar=False)

# Step 3: Calculate Mahalanobis distance using scipy
inv_cov = np.linalg.pinv(covariance)  # Use pseudo-inverse
distance = mahalanobis(point, mean, inv_cov)

# Round the result to 6 significant figures
distance_rounded = round(distance, 6)
print("Mahalanobis Distance (rounded to 6 sig figs):", distance_rounded)

Output:

Mahalanobis Distance (rounded to 6 sig figs): 0.246494

Which is the same as the C++ result.

Example 2: Non-Singular Covariance Matrix

int main() {
    // Example dataset with independent features
    Eigen::MatrixXd data(5, 3);
    data << 1.0, 2.0, 3.0,
            2.0, 3.0, 5.0,
            3.0, 6.0, 7.0,
            4.0, 8.0, 9.0,
            5.0, 9.0, 10.0;

    // Point to calculate the Mahalanobis distance for
    Eigen::VectorXd point(3);
    point << 3.5, 5.5, 6.5;

    // Step 1: Calculate the mean vector
    Eigen::VectorXd mean = calculateMean(data);

    // Step 2: Calculate the covariance matrix
    Eigen::MatrixXd covariance = calculateCovariance(data, mean);

    // Step 3: Calculate the Mahalanobis distance
    double distance = mahalanobisDistance(point, mean, covariance);

    // Output the result
    std::cout << "Mahalanobis Distance (Non-Singular Case): " << distance << std::endl;

    return 0;
}

Output:

Mahalanobis Distance (Non-Singular Case): 3.82473

To verify this result in Python:

import numpy as np
from scipy.spatial.distance import mahalanobis

# Example dataset
data = np.array([[1.0, 2.0, 3.0],
                 [2.0, 3.0, 5.0],
                 [3.0, 6.0, 7.0],
                 [4.0, 8.0, 9.0],
                 [5.0, 9.0, 10.0]])

# Point to test
point = np.array([3.5, 5.5, 6.5])

# Step 1: Calculate the mean vector
mean = np.mean(data, axis=0)

# Step 2: Calculate the covariance matrix
covariance = np.cov(data, rowvar=False)

# Step 3: Calculate Mahalanobis distance using scipy
inv_cov = np.linalg.inv(covariance)  # Use the standard inverse
distance = mahalanobis(point, mean, inv_cov)

# Option 1: Round the result to 5 decimal places
distance_rounded = round(distance, 5)
print("Mahalanobis Distance (rounded to 5 decimal places):", distance_rounded)

Output:

Mahalanobis Distance (rounded to 5 decimal places): 3.82473

Pros and Cons of Using Mahalanobis Distance

Pros:

Accounts for Correlations: Mahalanobis distance takes into account the correlation between variables, making it more robust than Euclidean distance for multivariate data.
Effective for Outlier Detection: It’s widely used for detecting outliers in multivariate datasets.
Invariant to Scale: It adjusts for the scale of the data automatically.

Cons:

Sensitive to Covariance Matrix: If the covariance matrix is singular, special care (e.g., pseudo-inverse) must be taken.
Computationally Expensive: Calculating the inverse or pseudo-inverse can be expensive for large datasets.
Sensitive to Noise: In noisy data, the Mahalanobis distance can become unstable, especially when the covariance matrix is poorly conditioned.

Conclusion

The Mahalanobis distance is a powerful tool for measuring multivariate distances, with applications ranging from outlier detection to classification. By accounting for correlations between variables, it provides a more accurate distance measure than Euclidean distance, especially for high-dimensional datasets.

Congratulations on reading to the end of this tutorial!

For further reading on distance calculations in C++ go to the article How to Calculate Hamming Distance in C++ (With C++20, Bit Shifting, Strings, and More).

Have fun and happy researching!

Suf

Senior Advisor, Data Science | [email protected] | + posts

Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.

Buy Me a Coffee