blog banner: how to get a substring from a string in python

What is a String?

A string is a sequence or array of Unicode character(s) containing alphanumeric or special characters. Unicode is a system designed to represent all characters from languages. In Unicode, each letter or character is represented by a 4-byte number. A string is one of the primitive data structures and is a fundamental building block for data manipulation and analysis. Many Python libraries use strings for Natural Language Processing. To learn more about these libraries, you can visit the article titled “Top 12 Python Libraries for Data Science and Machine Learning“.

Python has a built-in string class called str. Python strings are immutable, which means they cannot be changed once we create them. To manipulate strings, we have to define new strings to represent the manipulation due to their immutable nature. Strings, like all other data types in Python, can only be concatenated with other strings. If you concatenate conflicting data types, you will raise the TypeError can only concatenate str (not “int”) to str.

To represent a string, you need to wrap it within quotes; these can be single, double or triple quotes. Triple quoted strings let you work with multiple line strings, including the associated white space in the string.

We will explore the concept of the substring and how to extract substrings in Python.

What is a Substring in Python?

A substring is a part of a string. Because strings are arrays, we can slice a string using the index operators “[” and “]”. Slicing is a subscriptable operation, meaning it is a legal operation for subscriptable objects or that can contain other objects like a list. If we try to perform slicing on a non-subscriptable object like an integer, we raise the TypeError: ‘int’ object is not subscriptable.

The syntax of slicing is:

string[start:end:step]

We use “start” to define the starting index and “end” to mark the endpoint. We set “step” to jump n amount of characters at a time. Let’s take a look at an example of slicing:

string = "plrfegsmeqaorycahi"
print(string[2:len(string):2]
research

We are telling Python to start at the index of 2; bear in mind indexes always begin with 0, so this would be the third character. Then we tell Python to stop at the end of the string, which we can give by specifying the length of the string. The end will be one character before the end index. We set step to two to jump through characters two at a time.

Let’s look at a visual example of the string “research scientist” with the indices of each character, including the whitespace between “research” and “scientist”.

Example of a string with the character index highlighted
Example of a string with the character index highlighted. Source: Me

We can slice this string to get a substring, using the index as indicated, for example:

string = 'research scientist'
print(string[0:8])
research

The substring we end up with starts at index 0 and ends at the index that comes before the endpoint. In our example, the endpoint is 8, so the substring will end at index 7. Using the string above, let’s look at the three ways we can slice a string.

Using split()

String objects have the split() function as an attribute, allowing you to divide a string into a list of strings using a delimiter argument. Let’s look at an example of using the split() on a list of sentences.

# Define sentence list

sentences = ["Learning new things is fun", "I agree"]

# Iterate over items in list

for sentence in sentences:
    
    # Split sentence using white space

    words = sentence.split()
    
    print(words)

print(sentences.split())
['Learning', 'new', 'things', 'is', 'fun']

['I', 'agree']

If you try to split a list directly, you will raise the error “AttributeError: ‘list’ object has no attribute ‘split’“.

Using [start:]

We can slice a string by just using the start point. This slicing method will return a substring that begins at the start index and includes the rest of the string. Let’s look at an example of a start value of 9:

string = 'research scientist'
print(string[9:])
scientist

Our output shows the substring starts at index 9 of ‘research scientist”, which is “s”, and the slice returns the rest of the string, giving us “scientist.”

Using [:end]

We can use [: end] to specify the endpoint of the substring. This slicing method will return a substring with every string character that came before the end index. Let’s look at an example with the end value of 8:

string = 'research scientist'

print(string[:8])
research

The end index is 8, so the substring will include everything up to and including the character at index 7. This behaviour means that the end index is non-inclusive.

There are instances where we want to remove certain characters at the end of a string. Examples include filenames and websites. In those cases, we can use negative indices to index from the end of the string instead of the start. Let’s look at an example of removing a file type from a string:

string = 'fascinating_data.txt'

print(string[:-4])
fascinating_data

For this example of filetype, the last four characters will always be “.txt”, in which case we can slice from the end of the string, using the exact number of characters each time.

Using [start:end:step]

This slicing method is the most complex, adding “step” to a slice operator to skip certain characters. Let’s look at our example with a step size of 2:

string = 'research scientist'

print(string[0:15:2])
rsac cet

The step size of 2 means the substring has every second character starting from index 0 and ending at index 15.

Using List Comprehension

We can use a nifty combination of slicing and list comprehension to get all substrings from a string. Let’s look at an example with the string “PYTHON.” we have to specify two “for loops”, one to iterate over the string to obtain different start indices and one to iterate over the string to get the end indices.

string = 'PYTHON'

str = [string[i: j]
    for i in range(len(string))
    for j in range(i +1, len(string) + 1)]

print(str)
['P', 'PY', 'PYT', 'PYTH', 'PYTHO', 'PYTHON', 'Y', 'YT', 'YTH', 'YTHO', 'YTHON', 'T', 'TH', 'THO', 'THON', 'H', 'HO', 'HON', 'O', 'ON', 'N']

The output is an array of all possible sequential substrings in the string “PYTHON”.

Check if Substring Exists in Substring

In the article titled Python: Check if String Contains a Substring, I explore the various ways to check if a substring exists.

Similarities Between Strings

Strings can represent text documents of any size. We can explore similarities between documents by using similarity measures or distances such as Jaccard similarity or cosine similarity.

Python String Comparison

For further reading on how to compare strings using relational and identity operators, go to the article titled “How to Compare Strings In Python“.

Summary

Congratulations on making it to the end of this tutorial! We have gone through the definition of the string data type in Python and how to slice strings to get substrings.

Go to the online courses page on Python to learn more about Python for data science and machine learning.

Have fun and happy researching!