Introduction
Table of Contents
Python strings are series of Unicode text and we will explain why. In this chapter we will cover brief introduction what Python strings are, which type of object it belong, what kind of expressions can be used on strings, how strings are used programs, what kind of methods strings can use to accomplish various tasks. Python string formatting options will show you how to display and format strings. Last section we will explain the Unicode story which is tied to Python strings. Let’s begin!
String basics
Python strings are array of bytes representing Unicode characters, which is not encoding mechanism but big map of all world characters which are bound to its integer representation. Basically Python has three string types: STR (regular string text you can see on the screen), bytes (binary representation of string dana in memory) and bytearray, which is same thing as bytes but mutable set of data. Strings does not recognize character type, one character is simply string of length one.
Strings belong to bigger object types called sequences which also include lists and tuples. What it does mean? Well, with sequence data types you can access members with positioning (indexing), you can slice them and many more. Python strings also belongs to immutable type of data which means it cannot be changed in place. You can only create a modified string copy. We will see later in chapter how it works. Beside expressions, strings have methods to implement common operations tasks.
String are assigned in single, double and triple quotes:
>>> ‘This is string’,”This is string”,”””This is string”””
(‘This is string’, ‘This is string’, ‘This is string’) |
Without comma, string will automatically concatenate the content:
>>> ‘this’ “is” ‘a’ ‘string’
‘thisisastring’ |
Strings separated by comma produces tuple which is another sequence type of object.
You can use mix of quotes as long they are not writing same quotes more than once:
>>> a = “This is a ‘longer’ string”
>>> a “This is a ‘longer’ string” |
This produces error because we wrote double quotes two times:
>>> a = “This is a “longer” string”
File “<stdin>”, line 1 a = “This is a “longer” string” ^ SyntaxError: invalid syntax |
Triple quoted strings can be written in multiple lines:
>>> a = “””this
… is … a … string””” >>> a ‘this\na\nstring’ |
We can see that string is written in one line with \n escape character (see later in chapter) which represents new line.
If we evoke print method string is displayed with new lines:
>>> print(a)
This is a string |
Strings can’t be converted on the fly. If you try to add string and integer it will give an error. In this example we are trying to add string to integer with + operator. Operator + means two different things: addition with integers and concatenation with string. Python can’t determine which type of operation you are trying to perform:
>>> “string” + 1
Traceback (most recent call last): File “<stdin>”, line 1, in <module> TypeError: can only concatenate str (not “int”) to str |
Using built-in method such as int () and string you can convert values to desired Python object:
>>> “string” + str(1)
‘string1’ |
>>> a=”1″
>>> b=2 >>> a+b Traceback (most recent call last): File “<stdin>”, line 1, in <module> TypeError: can only concatenate str (not “int”) to str >>> int(a) + b 3 |
If you want to know which Unicode “code point” represents string character in memory you can use ord() function:
>>> ord(‘a’)
97 |
You can use chr() function to get opposite result:
>>> chr(97)
‘a’ |
Escape sequences
Escape characters represent special type of chars which have different representation. All escape sequences start with backslash. For example escape sequence \n creates new line and \t creates tab. Sequences are displayed as single characters regardless of two characters needed to display it.
New line escape character is basic example:
>>> print(‘a\nb’)
a b |
You can just echo the string value but print method will interpret the escape character:
>>> a = (‘a\nb’)
>>> a #JUST ECHO ‘a\nb’ >>> print(a) #PRINT WILL DO THE JOB OF INTERPRETATION a b |
Tab escape character:
>>> print(‘a\tb’)
a b |
Escape characters are represented by only one char which we can see with len() function. Len function will return five chars even string contains seven chars:
>>> a=(‘a\nb\tc’)
>>> len(a) 5 |
If you check the table you can see you can represent absolute binary value 0 with \0. It is not char which terminates string but just represent binary zero value, a holder in memory for upcoming binary value. You can check more about binaries in this chapter. For example when you define list of binary values which are not yet defined, all values are populated with 0.
>>> s = ‘k\0i\0j’
>>> print(s) k i j >>> len(s) 5 |
So you have char k, followed by zero binary value, char me, zero binary value and char j which gives total length of five.
Backslash sign is not always means escape character. For example \x represents hexadecimal ASCII character:
>>> print(‘\x41’)
A |
Checking the ASCII table, hex value of 41 represents char A.
Same thing is true for octal representation of ASCII character. Octal just uses backslash:
>>> print(‘\101’)
A |
Non-printable ASCII characters are represented will following sign:
>>> print(‘\01’) |
Simply there is no octal value of 01 which points to some ASCII code.
Here is good mix of absolute escape characters and binary zero:
>>> a = (‘1\t2\n\x003’)
>>> a 1 2 3 >>len(a) 5 |
First char is number 1, escape char \t,number 2, escape char \n, hexadecimal representation of binary zero and number 3. Total length of string is five.
Notice the last escape character \somechar. This output will repeat backslash with character behind it. This can be confusing when using absolute paths in operating system:
>>> c = ‘d:\sometext.txt’
>>> c ‘d:\\sometext.txt’ |
To avoid this behavior can be avoided by raw strings or double backslash.
Consider the case where you want to send OS path to some function:
path = ‘c:\newfolder\text.txt’ |
On the first look, this looks fine but notice we used two escape characters \n and \t and it will not produce desired result. There are two way to avoid unexpected behavior: raw string and double backslash (\\).
Raw string character will avoid interpretation of escape characters and treat string as is.
>>> c = r’d:\sometext.txt’ |
Double backslash:
>>> c = ‘d:\\sometext.txt’ |
String operations
Basic string operations includes concatenation and multiplication. This two operations include sign (+) for concatenation and sign (*) for multiplication. Operands are not used in classical way but Python utilizes operator overloading:
>>> a = ‘abc’
>>> b = ‘cde’ >>> c = a+b >>> print(c) abccde |
Using operator(+) will merge two strings and create a new one.
>>> d=’abc’
>>> id(d) 22926048 >>> d=’abc’*4 >>> id(d) 23211208 >>> print(d) abcabcabcabc |
Multiplying string will repeat the string sequence in x times. It is important to know that new string is created, existed one is left intact. This is because string immutability feature which means that original string can’t be modified.
Indexing
Strings are sequence type of objects which means it can be indexed by its position. String index starts from 0. If you want to start from the end you will use -1 index:
>>> a = ‘example’
>>> a[0] = 1 # print string element >>> a[0] ‘e’ >>> a[-1] # print last string element ‘e’ >>> a[len(a) – 3] #print string from length of string minus 3 ‘p’ |
Slicing
Slicing in Python gets part of the string. You can set lower and upper bound. Lower bound is always inclusive and upper bound is not. Substring as result is always new string if you remember that string are immutable.
>>> a
‘example’ >>> a[1:3] #print from first element(inclusive9 to third element(not included) ‘xa’ >>> a[1:] #print from first element(inclusive) to last element(inclusive) ‘xample’ >>> a[:] #print all elements ‘example’ >>> a[:-1] #print all elements except last one ‘exampl’ |
There is third parametar in python slicing called step:
>>> a[1:-1:2] #print every second element starting form position 1 except last one
‘xml’ >>> a[::2] #print every second element starting from beginning till end ‘eape’ >>> a[::-1] #print complete string in reverse order ‘elpmaxe’ |
In the last example we reversed string positions. If you for example print s [6:1:-1] it will print elements starting from position 6 (inclusive) till position 1 (not inclusive):
>>> a[6:1:-1]
‘lpmax’ |
Strings are immutable
Meaning that you can’t change the original string. You can only build completely new string. String cannot be changed.
>>> a = ‘string’
>>> a[0]=1 Traceback (most recent call last): File “<stdin>”, line 1, in <module> TypeError: ‘str’ object does not support item assignment |
If you want ot change content of the string, will have to use indexing and slicing methods:
>>> b= ‘1’ + a[1:]
>>> b ‘1tring’ |
In example above, we used conacentaion to change firste character of string.
Looking the memory location two string do not share same location which means they are completely different strings:
>>> id(a)
59931424 >>> id(b) 61166048 |
Another way to change original string content is function replace(). It will change defined string on the fly. But it will just temporary change string. Original value will still remain:
>>> a = ‘string’
>>> id(a) 59931424 >>> a.replace(‘st’,’ts’) #String is changed on the fly ‘tsring’ >>> id(a) 59931424 # Id of string a remains same >>> a ‘string’ #Original string value is intact |
Python string methods
Expressions and built in functions are generic, string method only works on string objects.
Functions are not tied to object, and method must be called together with objects. Method alters object state but function does not. Method is fine grained function that operates on object.
Object. Method (arguments) – left to right. Python will first fetch the method of the object and then call it, passing in both object and the arguments. Method can’t be run without subject.
Built-in functions dir() and help will list all available methods which belongs to str object:
>>> dir(str)
[‘__add__’, ‘__class__’, ‘__contains__’, ‘__delattr__’, ‘__dir__’, ‘__doc__’, ‘__eq__’, ‘__format__’, ‘__ge__’, ‘__getattribute__’, ‘__getitem__’, ‘__getnewargs__’, ‘__gt__’, ‘__hash__’, ‘__init__’, ‘__init_subclass__’, ‘__iter__’, ‘__le__’, ‘__len__’, ‘__lt__’, ‘__mod__’, ‘__mul__’, ‘__ne__’, ‘__new__’, ‘__reduce__’, ‘__reduce_ex__’, ‘__repr__’, ‘__rmod__’, ‘__rmul__’, ‘__setattr__’, ‘__sizeof__’, ‘__str__’, ‘__subclasshook__’, ‘capitalize’, ‘casefold’, ‘center’, ‘count’, ‘encode’, ‘endswith’, ‘expandtabs’, ‘find’, ‘format’, ‘format_map’, ‘index’, ‘isalnum’, ‘isalpha’, ‘isascii’, ‘isdecimal’, ‘isdigit’, ‘isidentifier’, ‘islower’, ‘isnumeric’, ‘isprintable’, ‘isspace’, ‘istitle’, ‘isupper’, ‘join’, ‘ljust’, ‘lower’, ‘lstrip’, ‘maketrans’, ‘partition’, ‘replace’, ‘rfind’, ‘rindex’, ‘rjust’, ‘rpartition’, ‘rsplit’, ‘rstrip’, ‘split’, ‘splitlines’, ‘startswith’, ‘strip’, ‘swapcase’, ‘title’, ‘translate’, ‘upper’, ‘zfill’] |
One good example is replace method which returns copy of replaced string:
>>> s = ‘this is a string’
>>> s.replace(‘string’,’text’) ‘this is a text’ |
This is just result in memory, if you want permanent change you have to assign another string variable that will store copy of changed string:
>>> a = s.replace(‘string’,’text’)
>>> a ‘this is a text’ |
If you have many changes which needs to be done on string it is better to convert to objects which supports in place upgrades like lists.
>>> b = list(a)
>>> b [‘s’, ‘t’, ‘r’, ‘i’, ‘n’, ‘g’] >>> b[0]=’a’ >>> b [‘a’, ‘t’, ‘r’, ‘i’, ‘n’, ‘g’] >>> a = ”.join(b) >>> a ‘atring’ |
One interesting method is split() which splits string in series of substring:
>>> a=’1,2,3′
>>> b.a.split(‘,’) Traceback (most recent call last): File “<stdin>”, line 1, in <module> AttributeError: ‘list’ object has no attribute ‘a’ >>> b=a.split(‘,’) >>> b [‘1’, ‘2’, ‘3’] |
Membership test
Simple test to check if substring is part of string can be done in this way:
>>> a = “string”
>>> ‘a’ in a False >>> ‘i’ in a True |
Upper and lower
String method isuuper() and islower() will return true all characters in string are upper or lower:
>>> a = “string”
>>> ‘a’ in a False >>> ‘i’ in a True >>> a.isupper() False >>> a.islower() True >>> a=”String” >>> a.islower() False >>> a.isupper() False >>> a ‘String’ >>> a = “string” >>> ‘a’ in a False >>> ‘i’ in a True >>> a.isupper() False >>> a.islower() True |
Start and end of string
Two useful strings method check if string start of ends with certain substring.
>>> a = “this is a string”
>>> a.startswith(‘this’) True >>> a.startswith(‘is’) False >>> a.endswith(‘is’) False >>> a.endswith(‘string’) True |
Split and join
The join(method) accepts list of string and returns single string:
>>> ‘,’.join([‘a’,’b’,’c’])
‘a,b,c’ >>> ‘SPLIT’.join([‘a’,’b’,’c’]) ‘aSPLITbSPLITc |
The split() methods works in opposite. It accepts string and return list of strings:
>>> ‘This is a string’.split(‘ ‘)
[‘This’, ‘is’, ‘a’, ‘string’] |
Stripping
The methods strip(); rstrip() and lstripos() are used to strip whitespaces from strings:
>>> a = “This is a string “
>>> a.rstrip() ‘This is a string’ >>> a= ” This is a string” >>> a.lstrip() ‘This is a string’ >>> a = ” this is a string “ >>> a.strip() ‘this is a string’ |
Python print format
Multiple type-specific substitutions on a string in a single step. It’s never strictly required, but it can be convenient, especially when formatting text to be displayed to a program’s users. Format strings contains curly braces {} as placeholders or replacement fields which gets replaced.
String formatting method calls: ‘…{}…’.format(values)
>>> print(order)
one,two >>> order = “{1},{0}”.format(‘one’,’two’) >>> print(order) two,one |
Also called “formatted string literals,” f-strings are string literals that have an f at the beginning and curly braces containing expressions that will be replaced with their values.
>>> print(f”the number is {number}”)
the number is 1 |
Strings and Unicode
To explain what Unicode is we have to explain what encoding is. Because computers can’t store characters in memory but binary data there should be some kind of translation. Computer has to know to store for example letter a. To keep standard unified first encoding standard was ASCI which was capable of storing 127 character. There are many popular encoding like UTF-8, UTF-16 and UTF-32. ASCII take 1 byte in memory and UTF-16 and UTF-32 take 2 and 4 byte in memory. We can say that ASCII is subset of UTF encoding till first 127 characters.
Now, this was not enough because there are so many characters (1.114.112). So somebody has invented Unicode. But Unicode is not encoding it just maps single character to various byte sizes. You have to choose one of the encoding standards to really encode something. Unicode is just map relationship. Character can take 1, 2, 4 or 8 bytes in memory, it all depends of encoding selected.
Python strings are just series of Unicode bytes.
Let’s check this in some examples:
If you want to check which number represents characters use ord function:
>>> ord(‘b’)
98 |
Other way around, if you want what number represents 98 you will use chr() function:
>>> chr(98)
‘b’ |
This example show how different types of encoding (ASCII, UTF16, UTF32) for same string. ASCII will store string in one byte format, UTF 16 takes 2 bytes of space and UTF32 takes fourth bytes of space.
>>> d = ‘abc’
>>> d.encode(‘ascii’),d.encode(‘utf16’),d.encode(‘utf32’) (b’abc’, b’\xff\xfea\x00b\x00c\x00′, b’\xff\xfe\x00\x00a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00′) |