How To Do Matrix Multiplication in C++

by Suf | C++, Linear Algebra

In this guide, we’ll explore efficient methods for implementing matrix multiplication in C++. We’ll cover various approaches, from basic implementations to cache-optimized and parallel algorithms.

Introduction
Mathematical Background
Basic Implementation
Cache-Optimized Implementation
Parallel Implementation
Matrix Multiplication with Eigen
Performance Comparison
Use Cases
Best Practices
Conclusion
Further Reading
Attribution and Citation

Introduction

Matrix multiplication is a fundamental operation in linear algebra with applications spanning from computer graphics to machine learning. While the basic algorithm is straightforward, implementing it efficiently requires careful consideration of memory access patterns and CPU architecture.

📚 Quick Reference

Matrix Multiplication

An operation that produces a matrix by taking the dot product of rows and columns from two matrices.

Transpose

A matrix obtained by swapping rows with columns in the original matrix.

Dot Product

A scalar resulting from multiplying corresponding elements of two vectors and summing the results.

Cache Optimization

A technique to enhance memory access speed by processing data in smaller blocks that fit into the CPU cache.

Parallel Computing

A method of splitting a computation across multiple threads or processors to reduce execution time.

OpenMP

An API that supports multi-platform shared-memory parallelism in C++, making multi-threading easier to implement.

Eigen Library

A C++ library for linear algebra, including efficient implementations of matrix operations and solvers.

Time Complexity

A measure of the computational efficiency of an algorithm in terms of the number of operations as input size grows.

Vectorization

A technique that uses CPU instructions to perform operations on multiple data points simultaneously for faster computation.

Benchmarking

The process of measuring the performance of code, often by comparing execution times of different implementations.

Mathematical Background

Matrix multiplication is a cornerstone operation in linear algebra. To multiply two matrices, the number of columns in the first matrix must match the number of rows in the second matrix.

Given two matrices:

A: A matrix of size \(m \times n\), where \(m\) is the number of rows and \(n\) is the number of columns.
B: A matrix of size \(n \times p\), where \(p\) is the number of columns.

The resulting matrix C will have dimensions \(m \times p\), and each element \(C_{ij}\) in this matrix is computed as:

\[ C_{ij} = \sum_{k=1}^n A_{ik} \cdot B_{kj} \]

This formula means that each element of the resulting matrix is the dot product of the corresponding row in matrix A and the corresponding column in matrix B.

Example: Multiplying Two Matrices

Let’s visualize this with a concrete example:

\[ A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix}, \quad B = \begin{bmatrix} 7 & 8 \\ 9 & 10 \\ 11 & 12 \end{bmatrix} \]

To compute the resulting matrix \(C\) of size \(2 \times 2\):

\[ C = A \times B = \begin{bmatrix} 1 \cdot 7 + 2 \cdot 9 + 3 \cdot 11 & 1 \cdot 8 + 2 \cdot 10 + 3 \cdot 12 \\ 4 \cdot 7 + 5 \cdot 9 + 6 \cdot 11 & 4 \cdot 8 + 5 \cdot 10 + 6 \cdot 12 \end{bmatrix} = \begin{bmatrix} 58 & 64 \\ 139 & 154 \end{bmatrix} \]

Here, each element of matrix \(C\) is calculated using the formula. For example:

\(C_{11} = (1 \cdot 7) + (2 \cdot 9) + (3 \cdot 11) = 58\)
\(C_{12} = (1 \cdot 8) + (2 \cdot 10) + (3 \cdot 12) = 64\)
\(C_{21} = (4 \cdot 7) + (5 \cdot 9) + (6 \cdot 11) = 139\)
\(C_{22} = (4 \cdot 8) + (5 \cdot 10) + (6 \cdot 12) = 154\)

Tip: Always verify that the number of columns in the first matrix matches the number of rows in the second matrix before performing multiplication.

Understanding this mathematical foundation is crucial for implementing matrix multiplication in code. In the next section, we’ll start with a basic implementation in C++ to bring this concept to life.

Basic Implementation

The basic implementation of matrix multiplication in C++ involves three nested loops:

The outer loop iterates over the rows of the first matrix (\(A\)).
The middle loop iterates over the columns of the second matrix (\(B\)).
The innermost loop computes the dot product for each element in the resulting matrix (\(C\)).

While this implementation is straightforward and intuitive, it can be computationally expensive for large matrices due to its \(O(m \cdot n \cdot p)\) time complexity.

Basic Matrix Multiplication


#include <iostream>
#include <vector>
#include <stdexcept> // For exception handling

// Matrix class to represent a 2D matrix
class Matrix {
    std::vector<std::vector<double>> data; // 2D vector to store matrix elements
    size_t rows, cols; // Number of rows and columns

public:
    // Constructor to initialize a matrix of size rows x cols with zeros
    Matrix(size_t r, size_t c) : rows(r), cols(c) {
        data.resize(r, std::vector<double>(c, 0.0)); // Fill with 0.0
    }

    // Static function to multiply two matrices A and B
    static Matrix multiply(const Matrix& A, const Matrix& B) {
        // Check if the dimensions are compatible for multiplication
        if (A.cols != B.rows) {
            throw std::invalid_argument("Matrix dimensions don't match");
        }

        // Initialize the result matrix C with dimensions A.rows x B.cols
        Matrix C(A.rows, B.cols);

        // Perform the matrix multiplication
        for (size_t i = 0; i < A.rows; i++) { // Loop over rows of A
            for (size_t j = 0; j < B.cols; j++) { // Loop over columns of B
                for (size_t k = 0; k < A.cols; k++) { // Loop over columns of A/rows of B
                    C.data[i][j] += A.data[i][k] * B.data[k][j]; // Accumulate the dot product
                }
            }
        }
        return C; // Return the resulting matrix
    }

    // Utility function to print the matrix (optional)
    void print() const {
        for (const auto& row : data) {
            for (double val : row) {
                std::cout << val << " ";
            }
            std::cout << "\n";
        }
    }

    // Friend function to allow direct access to the data for this example
    friend void populateMatrix(Matrix& matrix, const std::vector<std::vector<double>>& data);
};

// Function to populate a matrix with example values
void populateMatrix(Matrix& matrix, const std::vector<std::vector<double>>& data) {
    matrix.data = data;
}

Here’s a breakdown of what each part of the code does:

Constructor: Initializes a matrix with given dimensions and fills it with zeros. This is useful for creating the result matrix (\(C\)).
Dimension Check: Ensures that the number of columns in \(A\) matches the number of rows in \(B\), as required for matrix multiplication.
Triple Nested Loops: Implements the mathematical formula for matrix multiplication: \[ C_{ij} = \sum_{k=1}^n A_{ik} \cdot B_{kj} \]
Matrix Storage: Uses a 2D vector to store the elements of the matrix. This makes it easy to access rows and columns during computation.
Error Handling: Throws an exception if the matrices are not compatible for multiplication.

Example Usage

Let’s demonstrate how to use this class to multiply two matrices:

Matrix Multiplication Example


int main() {
    // Define two matrices A and B
    Matrix A(2, 3); // 2x3 matrix
    Matrix B(3, 2); // 3x2 matrix

    // Fill matrices with example values
    populateMatrix(A, {{1, 2, 3}, {4, 5, 6}});
    populateMatrix(B, {{7, 8}, {9, 10}, {11, 12}});

    // Multiply A and B
    Matrix C = Matrix::multiply(A, B);

    // Print the result
    std::cout << "Resultant Matrix C:\n";
    C.print();

    return 0;
}


Resultant Matrix C:
58 64
139 154

Warning: This basic implementation is not optimized for large matrices. For performance-critical applications, consider using the optimized or parallelized versions described later.

This basic implementation is a great starting point for understanding matrix multiplication, but it can be improved in terms of performance and memory usage. In the next section, we’ll explore cache-optimized matrix multiplication techniques.

Cache-Optimized Implementation

The basic implementation of matrix multiplication is straightforward but not efficient for large matrices. Modern CPUs are designed with cache memory to improve performance, and optimizing for cache can drastically reduce computation time.

Cache optimization involves dividing the matrices into smaller blocks (also known as tiling or blocking) that fit into the CPU cache. This reduces the number of cache misses and ensures faster memory access.

Cache-Friendly Matrix Multiplication


#include <iostream>
#include <vector>
#include <stdexcept> // For exception handling

// Matrix class to represent a 2D matrix
class Matrix {
private:
    std::vector<std::vector<double>> data; // 2D vector to store matrix elements
    size_t rows, cols; // Number of rows and columns

public:
    // Constructor to initialize a matrix of size rows x cols with zeros
    Matrix(size_t r, size_t c) : rows(r), cols(c) {
        data.resize(r, std::vector<double>(c, 0.0)); // Fill with 0.0
    }

    // Static function to multiply two matrices A and B
    static Matrix multiply(const Matrix& A, const Matrix& B) {
        // Check if the dimensions are compatible for multiplication
        if (A.cols != B.rows) {
            throw std::invalid_argument("Matrix dimensions don't match");
        }

        // Initialize the result matrix C with dimensions A.rows x B.cols
        Matrix C(A.rows, B.cols);

        // Perform the matrix multiplication
        for (size_t i = 0; i < A.rows; i++) { // Loop over rows of A
            for (size_t j = 0; j < B.cols; j++) { // Loop over columns of B
                for (size_t k = 0; k < A.cols; k++) { // Loop over columns of A/rows of B
                    C.data[i][j] += A.data[i][k] * B.data[k][j]; // Accumulate the dot product
                }
            }
        }
        return C; // Return the resulting matrix
    }

    // Cache-optimized multiplication using blocking
    static Matrix multiplyOptimized(const Matrix& A, const Matrix& B) {
        constexpr size_t BLOCK_SIZE = 32; // Tuning parameter based on CPU cache size
        if (A.cols != B.rows) {
            throw std::invalid_argument("Matrix dimensions don't match");
        }

        Matrix C(A.rows, B.cols);

        for (size_t i0 = 0; i0 < A.rows; i0 += BLOCK_SIZE) {
            for (size_t j0 = 0; j0 < B.cols; j0 += BLOCK_SIZE) {
                for (size_t k0 = 0; k0 < A.cols; k0 += BLOCK_SIZE) {
                    for (size_t i = i0; i < std::min(i0 + BLOCK_SIZE, A.rows); i++) {
                        for (size_t j = j0; j < std::min(j0 + BLOCK_SIZE, B.cols); j++) {
                            double sum = C.data[i][j];
                            for (size_t k = k0; k < std::min(k0 + BLOCK_SIZE, A.cols); k++) {
                                sum += A.data[i][k] * B.data[k][j];
                            }
                            C.data[i][j] = sum;
                        }
                    }
                }
            }
        }
        return C;
    }

    // Utility function to print the matrix
    void print() const {
        for (const auto& row : data) {
            for (double val : row) {
                std::cout << val << " ";
            }
            std::cout << "\n";
        }
    }

    // Friend function to allow direct access to the data for this example
    friend void populateMatrix(Matrix& matrix, const std::vector<std::vector<double>>& data);
};

// Function to populate a matrix with example values
void populateMatrix(Matrix& matrix, const std::vector<std::vector<double>>& data) {
    matrix.data = data;
}

int main() {
    // Define two matrices A and B
    Matrix A(4, 4); // 4x4 matrix
    Matrix B(4, 4); // 4x4 matrix

    // Fill matrices with example values
    populateMatrix(A, {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}, {13, 14, 15, 16}});
    populateMatrix(B, {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}, {13, 14, 15, 16}});

    // Perform cache-optimized multiplication
    Matrix C = Matrix::multiplyOptimized(A, B);

    // Print the result
    std::cout << "Resultant Matrix C:\n";
    C.print();

    return 0;
}


Resultant Matrix C:
90 100 110 120
202 228 254 280
314 356 398 440
426 484 542 600

The key difference between the basic and cache-optimized implementation is the use of blocking, where smaller submatrices (blocks) are processed at a time to make better use of the CPU cache.

In this example, the block size is set to 32, which is a typical size that aligns well with the L1 cache of most CPUs. However, the ideal block size can vary depending on the hardware.

Tip: Experiment with the block size to find the optimal value for your specific CPU. Common sizes are 32, 64, or 128.

In the next section, we will explore parallelizing matrix multiplication to take full advantage of modern multicore processors.

Parallel Implementation

With modern multicore processors, parallelization can significantly improve the performance of matrix multiplication. By dividing the computation across multiple threads, we can utilize all available CPU cores to perform operations simultaneously.

In this implementation, each thread handles a subset of rows from the resulting matrix \(C\). Using the standard library’s thread support, we can efficiently parallelize the nested loop structure.

Parallel Matrix Multiplication


#include <iostream>
#include <vector>
#include <stdexcept> // For exception handling
#include <thread> // For multithreading
#include <algorithm> // For std::min

// Matrix class to represent a 2D matrix
class Matrix {
private:
    std::vector<std::vector<double>> data; // 2D vector to store matrix elements
    size_t rows, cols; // Number of rows and columns

public:
    // Constructor to initialize a matrix of size rows x cols with zeros
    Matrix(size_t r, size_t c) : rows(r), cols(c) {
        data.resize(r, std::vector<double>(c, 0.0)); // Fill with 0.0
    }

    // Static function to multiply two matrices A and B in parallel
    static Matrix multiplyParallel(const Matrix& A, const Matrix& B) {
        if (A.cols != B.rows) {
            throw std::invalid_argument("Matrix dimensions don't match");
        }

        Matrix C(A.rows, B.cols);
        size_t num_threads = std::thread::hardware_concurrency(); // Get number of threads supported
        std::vector<std::thread> threads;

        // Worker function for each thread
        auto worker = [&](size_t start_row, size_t end_row) {
            for (size_t i = start_row; i < end_row; i++) {
                for (size_t j = 0; j < B.cols; j++) {
                    double sum = 0.0;
                    for (size_t k = 0; k < A.cols; k++) {
                        sum += A.data[i][k] * B.data[k][j];
                    }
                    C.data[i][j] = sum;
                }
            }
        };

        // Divide rows among threads
        size_t rows_per_thread = A.rows / num_threads;
        for (size_t i = 0; i < num_threads; i++) {
            size_t start_row = i * rows_per_thread;
            size_t end_row = (i == num_threads - 1) ? A.rows : (i + 1) * rows_per_thread;
            threads.emplace_back(worker, start_row, end_row);
        }

        // Join threads
        for (auto& thread : threads) {
            thread.join();
        }

        return C;
    }

    // Utility function to print the matrix
    void print() const {
        for (const auto& row : data) {
            for (double val : row) {
                std::cout << val << " ";
            }
            std::cout << "\n";
        }
    }

    // Friend function to allow direct access to the data for this example
    friend void populateMatrix(Matrix& matrix, const std::vector<std::vector<double>>& data);
};

// Function to populate a matrix with example values
void populateMatrix(Matrix& matrix, const std::vector<std::vector<double>>& data) {
    matrix.data = data;
}

int main() {
    // Define two matrices A and B
    Matrix A(4, 4); // 4x4 matrix
    Matrix B(4, 4); // 4x4 matrix

    // Fill matrices with example values
    populateMatrix(A, {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}, {13, 14, 15, 16}});
    populateMatrix(B, {{1, 2, 3, 4}, {5, 6, 7, 8}, {9, 10, 11, 12}, {13, 14, 15, 16}});

    // Perform parallel matrix multiplication
    Matrix C = Matrix::multiplyParallel(A, B);

    // Print the result
    std::cout << "Resultant Matrix C:\n";
    C.print();

    return 0;
}


Resultant Matrix C:
90 100 110 120
202 228 254 280
314 356 398 440
426 484 542 600

The parallel implementation works by dividing the rows of matrix \(C\) evenly among all available threads. Each thread is responsible for computing the dot products for a specific range of rows.

Tip: Ensure the number of threads is appropriate for your hardware. Using more threads than available CPU cores can lead to performance degradation due to context switching.

Warning: For small matrices, the overhead of thread creation and management may outweigh the performance gains from parallelization.

This parallel implementation demonstrates how multithreading can leverage the full power of modern CPUs. In the next section, we’ll compare all the implementations and discuss their trade-offs.

Matrix Multiplication with Eigen

Eigen is a powerful C++ library for linear algebra, matrix manipulation, and numerical computations. It provides a simple and efficient way to perform matrix operations, including multiplication. In this section, we’ll demonstrate how to multiply two 4×4 matrices using Eigen with the given data:

Matrix Multiplication with Eigen


#include <iostream>
#include <Eigen/Dense> // Include Eigen library

int main() {
    // Define two 4x4 matrices
    Eigen::Matrix4d A, B;

    // Initialize matrices with the given values
    A << 1, 2, 3, 4,
         5, 6, 7, 8,
         9, 10, 11, 12,
         13, 14, 15, 16;

    B << 1, 2, 3, 4,
         5, 6, 7, 8,
         9, 10, 11, 12,
         13, 14, 15, 16;

    // Perform matrix multiplication
    Eigen::Matrix4d C = A * B;

    // Output the result
    std::cout << "Matrix A:\n" << A << "\n\n";
    std::cout << "Matrix B:\n" << B << "\n\n";
    std::cout << "Matrix C (A * B):\n" << C << "\n";

    return 0;
}

Output

When you run the above code, you will see the following output:

Matrix A:
1  2  3  4
5  6  7  8
9 10 11 12
13 14 15 16

Matrix B:
1  2  3  4
5  6  7  8
9 10 11 12
13 14 15 16

Matrix C (A * B):
90   100  110  120
202  228  254  280
314  356  398  440
426  484  542  600

Explanation

In this example:

Matrix Initialization: The << operator is used to assign values row-by-row to Eigen matrices.
Matrix Multiplication: Eigen provides an overloaded * operator for matrix multiplication, simplifying the syntax.
Matrix Printing: Eigen supports direct printing of matrices to the standard output using std::cout.

Advantages of Using Eigen

Eigen is designed for both ease of use and high performance. Some of the key advantages include:

Optimized Performance: Eigen internally optimizes operations for speed and efficiency, utilizing techniques like lazy evaluation and vectorization.
Readability: The syntax for matrix operations is clean and intuitive, making it easier to write and debug code.
Flexibility: Eigen supports a wide range of matrix sizes, types (e.g., dense, sparse), and arithmetic precision (e.g., float, double).

Tip: To use Eigen, download the library from the Eigen website and include its directory in your project. Eigen is header-only, so no linking is required.

Performance Comparison

To evaluate the performance of different implementations for matrix multiplication, we benchmarked the following methods:

Basic Multiplication: A straightforward nested-loop implementation.
Cache-Optimized Multiplication: A block-based approach to improve cache utilization.
Parallel Multiplication: A multi-threaded implementation using standard threads.
Eigen (Single-threaded): A matrix multiplication using the Eigen library with threading disabled.
Eigen (Multi-threaded with OpenMP): Eigen's built-in multi-threading powered by OpenMP.

Performance Comparison Code


#include <iostream>
#include <vector>
#include <chrono> // For timing
#include <stdexcept>
#include <thread>
#include <Eigen/Dense> // For Eigen
#include <omp.h> // For OpenMP

// Matrix class for basic, cache-optimized, and parallel multiplication
class Matrix {
private:
    std::vector<std::vector<double>> data;
    size_t rows, cols;

public:
    Matrix(size_t r, size_t c) : rows(r), cols(c) {
        data.resize(r, std::vector<double>(c, 0.0));
    }

    static Matrix multiplyBasic(const Matrix& A, const Matrix& B) {
        if (A.cols != B.rows) throw std::invalid_argument("Matrix dimensions don't match");
        Matrix C(A.rows, B.cols);
        for (size_t i = 0; i < A.rows; ++i)
            for (size_t j = 0; j < B.cols; ++j)
                for (size_t k = 0; k < A.cols; ++k)
                    C.data[i][j] += A.data[i][k] * B.data[k][j];
        return C;
    }

    static Matrix multiplyCacheOptimized(const Matrix& A, const Matrix& B) {
        constexpr size_t BLOCK_SIZE = 32;
        if (A.cols != B.rows) throw std::invalid_argument("Matrix dimensions don't match");
        Matrix C(A.rows, B.cols);
        for (size_t i0 = 0; i0 < A.rows; i0 += BLOCK_SIZE)
            for (size_t j0 = 0; j0 < B.cols; j0 += BLOCK_SIZE)
                for (size_t k0 = 0; k0 < A.cols; k0 += BLOCK_SIZE)
                    for (size_t i = i0; i < std::min(i0 + BLOCK_SIZE, A.rows); ++i)
                        for (size_t j = j0; j < std::min(j0 + BLOCK_SIZE, B.cols); ++j)
                            for (size_t k = k0; k < std::min(k0 + BLOCK_SIZE, A.cols); ++k)
                                C.data[i][j] += A.data[i][k] * B.data[k][j];
        return C;
    }

    static Matrix multiplyParallel(const Matrix& A, const Matrix& B) {
        if (A.cols != B.rows) throw std::invalid_argument("Matrix dimensions don't match");
        Matrix C(A.rows, B.cols);
        size_t num_threads = std::thread::hardware_concurrency();
        std::vector<std::thread> threads;
        auto worker = [&](size_t start_row, size_t end_row) {
            for (size_t i = start_row; i < end_row; ++i)
                for (size_t j = 0; j < B.cols; ++j)
                    for (size_t k = 0; k < A.cols; ++k)
                        C.data[i][j] += A.data[i][k] * B.data[k][j];
        };
        size_t rows_per_thread = A.rows / num_threads;
        for (size_t i = 0; i < num_threads; ++i) {
            size_t start_row = i * rows_per_thread;
            size_t end_row = (i == num_threads - 1) ? A.rows : (i + 1) * rows_per_thread;
            threads.emplace_back(worker, start_row, end_row);
        }
        for (auto& thread : threads) thread.join();
        return C;
    }

    friend void populateMatrix(Matrix& matrix, const std::vector<std::vector<double>>& data);
};

void populateMatrix(Matrix& matrix, const std::vector<std::vector<double>>& data) {
    matrix.data = data;
}

int main() {
    const size_t rows = 1000, cols = 1000;

    // Initialize matrices for custom implementations
    Matrix A(rows, cols), B(cols, rows);
    populateMatrix(A, std::vector<std::vector<double>>(rows, std::vector<double>(cols, 1.0)));
    populateMatrix(B, std::vector<std::vector<double>>(cols, std::vector<double>(rows, 2.0)));

    // Measure Basic Multiplication
    auto start = std::chrono::high_resolution_clock::now();
    Matrix::multiplyBasic(A, B);
    auto end = std::chrono::high_resolution_clock::now();
    std::cout << "Basic Multiplication: "
              << std::chrono::duration<double>(end - start).count() << " seconds\n";

    // Measure Cache-Optimized Multiplication
    start = std::chrono::high_resolution_clock::now();
    Matrix::multiplyCacheOptimized(A, B);
    end = std::chrono::high_resolution_clock::now();
    std::cout << "Cache-Optimized Multiplication: "
              << std::chrono::duration<double>(end - start).count() << " seconds\n";

    // Measure Parallel Multiplication
    start = std::chrono::high_resolution_clock::now();
    Matrix::multiplyParallel(A, B);
    end = std::chrono::high_resolution_clock::now();
    std::cout << "Parallel Multiplication: "
              << std::chrono::duration<double>(end - start).count() << " seconds\n";

    // Initialize Eigen matrices
    Eigen::MatrixXd eigenA = Eigen::MatrixXd::Constant(rows, cols, 1.0);
    Eigen::MatrixXd eigenB = Eigen::MatrixXd::Constant(cols, rows, 2.0);

    // Measure Eigen Multiplication (Single-threaded)
    omp_set_num_threads(1); // Disable multi-threading
    start = std::chrono::high_resolution_clock::now();
    Eigen::MatrixXd eigenC1 = eigenA * eigenB;
    end = std::chrono::high_resolution_clock::now();
    std::cout << "Eigen (Single-threaded): "
              << std::chrono::duration<double>(end - start).count() << " seconds\n";

    // Measure Eigen Multiplication (Multi-threaded with OpenMP)
    omp_set_num_threads(4); // Enable multi-threading with OpenMP
    start = std::chrono::high_resolution_clock::now();
    Eigen::MatrixXd eigenC2 = eigenA * eigenB;
    end = std::chrono::high_resolution_clock::now();
    std::cout << "Eigen (Multi-threaded): "
              << std::chrono::duration<double>(end - start).count() << " seconds\n";

    return 0;
}

Performance Results

Performance Results:

Matrix Size: 1000x1000
Basic Multiplication: 27.6698 seconds
Cache-Optimized Multiplication: 20.6074 seconds
Parallel Multiplication: 3.48806 seconds
Eigen (Single-threaded): 12.3204 seconds
Eigen (Multi-threaded): 2.85874 seconds

Analysis

The performance results highlight the significant benefits of optimization and parallelization:

Basic Multiplication: The simplest implementation, which takes approximately 27.67 seconds. It demonstrates poor performance due to inefficient cache utilization and lack of parallelization.
Cache-Optimized Multiplication: Improves execution time to 20.61 seconds by utilizing a blocking strategy to optimize cache usage. This provides a notable improvement over the basic approach.
Parallel Multiplication: Leverages multiple threads, reducing execution time to 3.49 seconds. This demonstrates how distributing workload across cores can dramatically improve performance.
Eigen (Single-threaded): Eigen’s optimized matrix multiplication performs well, completing in 12.32 seconds, showcasing the library’s efficient algorithms even without multi-threading.
Eigen (Multi-threaded with OpenMP): Achieves the best performance at 2.86 seconds. OpenMP enables Eigen to fully utilize multiple cores, delivering exceptional speed for large-scale computations.

Using OpenMP with Eigen

Eigen's multi-threading leverages OpenMP for performance gains. To enable multi-threading with Eigen:

Install OpenMP (e.g., via Homebrew on macOS: brew install libomp).
Compile with OpenMP flags:
- For GCC/Clang: -fopenmp
- For MSVC: Enable OpenMP in the project settings.
Use Eigen::setNbThreads(n) or omp_set_num_threads(n) to control the number of threads.

You can find more details about Eigen's multi-threading capabilities in their official documentation.

Tip: Limit the number of threads to the number of physical CPU cores to avoid performance degradation due to hyper-threading. For best results, benchmark with different thread counts to find the optimal configuration for your hardware.

Warning: The performance values provided in this comparison are specific to the system configuration used for testing, including hardware specifications, compiler optimizations, and the number of threads. Your Mileage May Vary (YMMV) based on factors such as:

CPU architecture and the number of physical cores versus logical threads (hyper-threading).
Cache size, memory bandwidth, and overall system load during execution.
Compiler optimizations (e.g., -O2, -O3) and OpenMP settings.
Matrix size and sparsity, which can affect cache utilization and parallelization efficiency.

Always benchmark on your specific hardware and with your own data to get accurate performance metrics for your use case.

Performance Optimization Techniques

Efficient matrix multiplication is critical for applications in data science, machine learning, and high-performance computing. Below are optimization techniques tailored to improve the performance of the different methods covered in this post:

Basic Multiplication:
- Reduce Overhead: Minimize unnecessary calculations and loop overhead by unrolling loops manually or relying on compiler optimizations like -O3 in GCC/Clang.
- Rearrange Loops: Adjust the order of loops to prioritize memory access that aligns with the matrix's data layout (row-major or column-major).
Cache-Optimized Multiplication:
- Tune Block Sizes: Experiment with block sizes to match the specific CPU cache (L1, L2, or L3). For example, start with a block size of 32 or 64 and adjust based on performance profiling.
- Threaded Blocking: Combine the cache-optimized approach with parallel processing by assigning blocks to different threads, ensuring efficient utilization of multi-core CPUs.
- Use SIMD: Leverage SIMD (Single Instruction, Multiple Data) instructions to process multiple matrix elements simultaneously within a block, reducing computational cycles.
- Matrix Transposition: Pre-transpose one of the matrices to align memory access patterns and reduce cache misses during computation.
- Prefetching: Explicitly load data into the CPU cache ahead of usage by using compiler hints or loop-based prefetching.
Parallel Multiplication:
- Dynamic Thread Allocation: Allocate threads dynamically based on the workload and matrix size to avoid overhead for smaller matrices.
- Affinity Settings: Bind threads to specific CPU cores to minimize context switching and maximize cache reuse.
- Divide Work Strategically: Split the rows or blocks among threads evenly, considering the total number of rows and the number of available threads.
Eigen (Single-threaded):
- Optimize Data Alignment: Ensure matrix data is properly aligned in memory to take full advantage of Eigen's internal optimizations and vectorized computations.
- Use Fixed-Size Matrices: For small matrices, specify fixed sizes at compile time to allow Eigen to apply additional optimizations.
Eigen (Multi-threaded with OpenMP):
- Control Thread Count: Use omp_set_num_threads(n) or Eigen’s setNbThreads(n) to set the optimal number of threads, typically equal to the number of physical CPU cores.
- Profile Thread Overheads: Benchmark with different thread counts and workloads to ensure the threading overhead does not outweigh the performance gains.
- Enable Compiler Optimizations: Compile with -O3 or similar optimization flags to enable advanced vectorization and parallelization features.

These optimization techniques can dramatically improve the efficiency of matrix multiplication, making it suitable for both small-scale applications and large-scale computations. The best approach depends on the size of the matrices, hardware specifications, and the computational context.

Use Cases

Matrix multiplication is a fundamental operation in various domains of science, engineering, and technology. Its versatility stems from its ability to represent complex systems and transformations in a concise mathematical form. Here are some of the most common applications of matrix multiplication:

Computer Graphics: Matrix multiplication is essential in transforming 3D models for rendering. Common operations include scaling, rotation, and translation, which are represented as matrix transformations. For example, in video games, multiplying a vector by a transformation matrix updates the position of objects in 3D space.
Example: \[ \text{Transformed Position} = \text{Transformation Matrix} \times \text{Original Position Vector} \]
Machine Learning: Neural networks heavily rely on matrix multiplication for forward propagation and backpropagation. Matrix operations are used to compute activations, gradients, and updates for weights in training large models like GPT.
Example: \[ \text{Output Layer} = \text{Weights Matrix} \times \text{Input Vector} + \text{Bias Vector} \]
Scientific Computing: Solving systems of linear equations, eigenvalue problems, and simulations often require matrix multiplication. For example, in physics, it is used in finite element analysis (FEA) to model complex systems like fluid flow or stress analysis in materials.
Example: \[ \text{Ax} = \text{b} \implies \text{x} = \text{A}^{-1} \times \text{b} \]
Image Processing: Matrices represent images, where each pixel corresponds to an element in the matrix. Convolution operations, a type of matrix multiplication, are widely used in filters, edge detection, and feature extraction in computer vision.
Example: \[ \text{Filtered Image} = \text{Kernel Matrix} \ast \text{Input Image Matrix} \]
Quantum Computing: Quantum states and operations are described using matrices. Matrix multiplication is used to compute the evolution of quantum states under unitary transformations and for simulating quantum gates.
Example: \[ \text{Final State} = \text{Quantum Gate Matrix} \times \text{Initial State Vector} \]
Robotics: Matrices represent kinematic chains in robotic systems. Matrix multiplication is used to compute the positions and orientations of robotic arms and legs in 3D space.
Example: \[ \text{End Effector Position} = \text{Transformation Matrices Chain} \times \text{Base Position} \]
Cryptography: Some cryptographic algorithms use matrix multiplication to encode and decode messages. Matrix-based methods like Hill Cipher leverage the properties of modular arithmetic with matrices.
Example: \[ \text{Encrypted Message} = \text{Key Matrix} \times \text{Plaintext Vector} \mod \text{n} \]
Natural Language Processing: Attention Mechanism: Matrix multiplication is the backbone of the attention mechanism, a revolutionary concept in deep learning that underpins modern large language models (LLMs) like GPT, BERT, and other transformer-based architectures. The attention mechanism enables these models to dynamically focus on relevant parts of input sequences, capturing complex dependencies and relationships.
Example: \[ \text{Attention(Q, K, V)} = \text{softmax}\left(\frac{Q \cdot K^\top}{\sqrt{d_k}}\right) \cdot V \] Here:
- \(Q\): Query matrix representing the input tokens.
- \(K\): Key matrix representing the input context.
- \(V\): Value matrix representing the context’s values.
- \(d_k\): Dimensionality of the keys, used to scale the dot product.
This mechanism allows LLMs to weigh the importance of different tokens in the input sequence dynamically, providing the flexibility to understand and generate coherent and contextually accurate outputs. For example, in a sentence, attention helps the model determine which words to focus on to capture nuanced meanings, such as understanding pronouns or idiomatic expressions.

Without matrix multiplication at its core, attention mechanisms—and by extension, transformer models—would not be computationally feasible. The scalability and efficiency of these operations are critical for training and deploying LLMs that power applications like chatbots, search engines, and automated content generation.

These examples illustrate the broad applicability of matrix multiplication across diverse fields. Its efficient implementation is crucial for advancing technologies in artificial intelligence, data processing, and scientific discovery.

Tip: Understanding the specific requirements of your application (e.g., real-time processing, precision, or scalability) will help you choose the most appropriate matrix multiplication implementation.

Best Practices

Implementing matrix multiplication efficiently requires adhering to certain best practices to ensure optimal performance, maintainability, and accuracy. Here are some key recommendations:

Choose the Right Implementation: Use the basic method for small matrices, cache-optimized for moderately sized ones, and parallel implementations for large-scale computations.
Optimize Memory Access: Structure your loops to prioritize row-major or column-major access based on the data layout of your matrix. This reduces cache misses.
Test with Realistic Data: Benchmark your implementation with real-world matrix sizes and values to ensure it performs well in your specific use case.
Leverage Existing Libraries: For production-grade applications, consider using optimized libraries like Eigen, Armadillo, or Intel MKL, which implement highly optimized matrix operations.
Parallelism: Use multithreading or GPU acceleration where appropriate, but ensure the overhead of thread or task management does not outweigh the benefits for smaller matrices.

Common Pitfalls

Mismatched Dimensions: Ensure the number of columns in the first matrix matches the number of rows in the second matrix. Mismatched dimensions will lead to runtime errors or incorrect results.
Thread Overhead: For small matrices, the overhead of creating and managing threads can result in slower performance than a single-threaded implementation.
Floating-Point Precision: Large matrices can accumulate rounding errors due to the limited precision of floating-point numbers. Use higher precision (e.g., `double` or `long double`) if accuracy is critical.
Poor Cache Utilization: Accessing data out of order can lead to frequent cache misses. Arrange your loops and data layout to align with the memory hierarchy.
Ignoring Compiler Optimizations: Always enable compiler optimizations (e.g., `-O2` or `-O3` in GCC/Clang) to take advantage of automatic loop unrolling and vectorization.

Tip: Always validate your implementation with small matrices before scaling up to larger datasets. This helps catch bugs early and ensures correctness.

Conclusion

Matrix multiplication is a cornerstone operation in linear algebra and a fundamental building block in various fields, from computer graphics and machine learning to robotics and quantum computing. Its ability to model transformations, solve systems of equations, and process high-dimensional data makes it indispensable in both theoretical and practical applications.

However, implementing matrix multiplication efficiently requires careful attention to computational complexity, memory access patterns, and hardware capabilities. Challenges such as high execution time for large matrices, cache inefficiencies, and the overhead of parallelization can hinder performance if not addressed properly. Leveraging techniques like cache-optimized algorithms, parallel processing, and numerical libraries like Eigen or BLAS can help overcome these challenges while achieving scalability and accuracy.

In this guide, we explored three approaches to matrix multiplication in C++: the basic implementation for educational purposes, a cache-optimized version to reduce memory access overhead, and a parallelized implementation to utilize multi-core processors effectively. Along the way, we demonstrated practical examples, analyzed performance metrics, and shared best practices to help you design robust solutions tailored to your needs.

Congratulations on reading to the end of this tutorial. If you found it useful, please share the blog post by using the Attribution and Citation buttons below.

Feel free to explore the Further Reading section for additional resources to deepen your understanding of C++ and its Standard Template Library (STL).

Attribution and Citation

If you found this guide and tools helpful, feel free to link back to this page or cite it in your work!

Suf

Senior Advisor, Data Science | [email protected] | + posts

Suf is a senior advisor in data science with deep expertise in Natural Language Processing, Complex Networks, and Anomaly Detection. Formerly a postdoctoral research fellow, he applied advanced physics techniques to tackle real-world, data-heavy industry challenges. Before that, he was a particle physicist at the ATLAS Experiment of the Large Hadron Collider. Now, he’s focused on bringing more fun and curiosity to the world of science and research online.

Buy Me a Coffee ✨

How To Do Matrix Multiplication in C++

Table of Contents

Introduction

Mathematical Background

Example: Multiplying Two Matrices

Basic Implementation

Example Usage

Cache-Optimized Implementation

Parallel Implementation

Matrix Multiplication with Eigen

Output

Explanation

Advantages of Using Eigen

Performance Comparison

Performance Results

Analysis

Using OpenMP with Eigen

Performance Optimization Techniques

Use Cases

Best Practices

Common Pitfalls

Conclusion

Further Reading

Attribution and Citation

Suf