How to Calculate the Hamming Distance in Python

by | Data Science, Programming, Python, Tips

Hamming distance is a type of string metric for finding how similar two binary data strings are. If the strings are equal in length, Hamming distance determines the number of bit positions different between them. We can also describe Hamming distance as the minimum number of substitutions required to change one string into another or the minimum number of errors that transform one string to another. In this article, we will go through examples of Hamming distance using Python and applications of Hamming Distance across multiple disciplines.


Visual Description of Hamming Distance

Let’s look at an example of calculating the Hamming distance between two DNA sequences:

There are edits at two locations between the two strings, and therefore the Hamming distance is 2.

The Hamming distance applies to any string, not just DNA sequences. Calculating the Hamming distance by hand can be time-consuming once strings become hundreds or thousands of characters long.

For ease and speed, we can calculate the Hamming distance programmatically.

Calculating the Hamming Distance Between Two Bit Strings

In data science, you may encounter bitstrings when dealing with one-hot encoded categorical columns of data. To calculate the Hamming distance between bitstrings, we sum the differences between the strings, which will always be 0 or 1, then normalize the value by dividing by the length of the bitstrings. Let’s look at making a Hamming distance function and apply it to two bitstrings.

# Manually calculating the Hamming distance between two bit strings

# Custom function

def hamming_distance(a, b):

# Determine if strings are equal in length

    if len(a) != len(b):

        print('String are not equal in length')

    else:

        return sum(abs(e1 - e2) for e1, e2 in zip(a, b))/ len(a)

# Input data

bit_1 = [0, 0, 1, 0, 1, 0]

bit_2 = [0, 0, 0, 1, 0, 0]

# Calculate distance

dist = hamming_distance(bit_1, bit_2)

# Print result

print(dist)
0.5

We can see by eye that there are three differences between the bitstrings, or three out of the six-bit positions are different. When averaged over the length of the bitstrings (3 / 6 ), the Hamming distance is 0.5.

Python has many scientific computing libraries that provide distance metric functions. Let’s look at an example of using SciPy to calculate the Hamming distance between the same bitstrings in the manual example.

# Calculating Hamming distance between bit strings using SciPy

# Get Hamming function

from scipy.spatial.distance import hamming

# Calculate Hamming distance

dist = hamming(bit_1, bit_2)

# Print result

print(dist)
0.5

Running the code, we can see we obtain the same result as the manual implementation of 0.5.

The History of the Hamming Distance

Richard Hamming invited the Hamming distance, an American mathematician, who programmed IBM calculating machines for the Manhattan project. He moved to Bell Laboratories from the Manhattan Project and developed the Hamming distance, which has profound implications for the fields of computer science and telecommunications.

In 1950, Hamming published a paper that postulated it could detect and correct errors in bit strings by calculating the number of disparate bits between valid and erroneous code, which became known as the Hamming distance.

The Hamming Weight

The Hamming weight is the Hamming distance from the zero string, a string consisting of all zeros of the same length as the selected string. In other words, it is the number of elements in the string that are not zero. This value would be the number of 1’s present in the string for a binary string. For example, the Hamming weight of 101011 is 4. The hamming weight is also called the population count, popcount, sideways sum, or bit summation.

Metric Space using Hamming Distance

The metric space of length-n binary strings, using the Hamming distance, is known as the Hamming cube and is equivalent as a metric space to the set of distances between vertices in a hypercube graph.

Hamming distance cube for 4-bit binary numbers
Hamming distance cube for 4-bit binary numbers. Source

We can view a binary string of length n as a vector in \mathbb{R}^{n}, where each character or symbol in the string is a coordinate. The strings form the vertices of an n-dimensional hypercube, and the Hamming distance of the strings is equivalent to the Manhattan distance between the vertices.

Limitations of Hamming Distance

For comparing strings that are different lengths, or strings where not just substitutions but also insertions or deletions can occur, we can use the Levenshtein distance.

Applications of Hamming Distance

Hamming distance has several applications, including:

  • Block code in coding theory, where block code may refer to any error-correcting code that acts on a block of k bits of input data to produce n bits of output data.
  • Automatic spelling correction can determine candidate corrections for a misspelt word by selecting words from a dictionary that have a low Hamming distance from the misspelt word.
  • In bioinformatics, we can use the Hamming distance to quantify the similarity between DNA sequences, which we can represent as strings of the characters A, C, G, and T.
  • In telecommunication, Hamming distance is used to count the number of flipped bits in a fixed-length binary word as an estimation of error. This is called the signal distance.
  • Machine learning like semi-supervised clustering, nearest neighbor classification, and kernel-based methods require the existence of a pairwise similarity measure on the input space. The Hamming distance provides a natural similarity measure between binary codes and can be computed with a few machine instructions per comparison. The exact nearest neighbor search in Hamming space is significantly faster than linear search, with sublinear run-times.
  • Employing Hamming distance-based test statistics for studies concerning population heterogeneity.

Summary

Congratulations on reading to the end of this article!

You know the Hamming distance, its appropriate use, and how to calculate it in Python both manually and through importing SciPy.

The Hamming distance between two strings of equal length is the minimum number of edits required to turn one string into another.

The Hamming distance has limitations; it does not apply to bitstrings of different lengths, in which case we can use the Levenshtein distance.

Hamming distance has many applications in multiple disciplines and is suitable for vast datasets with interpretable results.

For how to calculate the Hamming distance in R, go to the article: How to Calculate the Hamming Distance in R.

You can go to the article on Cosine Similarity and Jaccard Similarity for more on similarity measures.

Have fun and happy researching!

Research Scientist at Moogsoft | + posts

Suf is a research scientist at Moogsoft, specializing in Natural Language Processing and Complex Networks. Previously he was a Postdoctoral Research Fellow in Data Science working on adaptations of cutting-edge physics analysis techniques to data-intensive problems in industry. In another life, he was an experimental particle physicist working on the ATLAS Experiment of the Large Hadron Collider. His passion is to share his experience as an academic moving into industry while continuing to pursue research. Find out more about the creator of the Research Scientist Pod here and sign up to the mailing list here!