In Python 2.x, the ‘u’ in front of string values indicates that the string is a Unicode string. In Python 3, all strings are Unicode by default, and therefore you will not see the ‘u’ in front of a Unicode string.

This tutorial will go through the use of Unicode strings in Python and the differences in defining strings between Python major version 2 and major version 3.


What is a String in Python?

A string is an array of bytes representing characters, where each character is a symbol. Computers handle the binary that represents characters. The conversion of characters to binary is called encoding, and the reverse is decoding. ASCII and Unicode are the most commonly used encodings.

What is ASCII?

ASCII stands for American Standard Code for Information Interchange and is a character encoding standard for electronic communication. We use ASCII codes to represent text in computers. ASCII represents 128 English characters, with each letter assigned a specific number between 0 and 127.

What is Unicode?

Unicode, formally the Unicode Standard, represents every character in every spoken language in the world by assigning each character a unique number. The Unicode Consortium maintains the Unicode Standard, which holds more than 140,000 characters, including historic scripts, symbols, and emojis. Unicode represents many more characters than ASCII. In Python 3, strings are Unicode by default, but on Python 2 the u in front distinguishes Unicode strings.

What is the ‘u’ Before a String in Python?

In Python 2, we can create a Unicode string by putting a u in front of the string or using the unicode() method. The unicode() method exists for Python 2 only. Let’s look at an example:

import sys

print sys.version

string = u'test'

print type(string) 

string2 = unicode('test')

print type(string2)
2.7.16 |Anaconda, Inc.| (default, Sep 24 2019, 16:55:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
<type 'unicode'>
<type 'unicode'>

By default, all Python 2 strings are str type, which is simply bytes, and the default encoding is ASCII.

string = 'test'

print type(string)
<type 'str'>

We can use Unicode for non-English characters, for example, “Learning is fun!” in Cantonese:

string = u'學習好有趣!'

print string 

print type(string)
學習好有趣!
<type 'unicode'>

What is the r symbol Before String in Python?

The r symbol tells the Python interpreter to interpret the string as a string literal. For example, you can use r to tell the Python interpreter to interpret a backslash as “just a backslash” instead of an escape sequence which we use to represent newlines \n, tabs \t, etc.

string = 'test\"'

print(string)

string2 = r'test\"'

print(string2)
test"
test\"

The r symbol is helpful for writing regular expressions because the syntax of regular expression patterns uses backslashes often.

What is the b’ symbol Before String in Python?

The b' notation specifies a bytes string in Python. A bytes string is an array of byte variables where each hexadecimal element has a value between 0 and 255. In Python 3, we can encode a regular string into the bytes string format with the b’ symbol. Let’s look at an example:

import sys

print(sys.version)

string = b'this is a string'

print(string)

print(type(string))
3.8.8 (default, Apr 13 2021, 12:59:45) 
[Clang 10.0.0 ]
b'this is a string'
<class 'bytes'>

In Python 2, the interpreter ignores the prefix of b because bytes and str are equivalent in Python 2. We can verify this with the following code:

import sys

print sys.version

string = 'test'

print type(string) == bytes
2.7.16 |Anaconda, Inc.| (default, Sep 24 2019, 16:55:38) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
True

Summary

Congratulations on reading the end of this tutorial. Prefixing a u to a string in Python 2 specifies a Unicode type string. By default, a Python 2 string is <str> which is raw bytes. By default, string literals in Python 3 are Unicode, so there is no need to use a u before a string in Python 3.

For further reading on strings and decoding, go to the article: How to Solve Python AttributeError: ‘str’ object has no attribute ‘decode’.

Go to the online courses page on Python to learn more about coding in Python for data science and machine learning.

Have fun and happy researching!