Week 3: Notes

ASCII and Unicode

Computers store text using a coded character set, which assigns a unique number called a code point to each character. Two character sets are used in virtually all software systems today.

First, the ASCII character set includes only 128 characters; its code points range from 0 to 127. For example, in ASCII the character 'A' has the number 65, and 'B' has the number 66. 'a' has the number 97. (Code points are often written in hexadecimal; then e.g. the code for 'A' is 65₁₀ = 41₁₆, and the code for 'a' is 97₁₀ = 61₁₆.)

ASCII includes all the characters you see on a standard English-language keyboard: the uppercase and lowercase letters A-Z/a-z of the Latin alphabet, the numbers 0-9 and various punctuation marks such as $, % and &. ASCII does not include accented characters such as č or ř.

You can ses a table of all ASCII characters at asciitable.com.

Note that ASCII includes various whitespace characters, which are not visible on the printed page. We will encounter some of these sometimes:

A tab character (ASCII code 9) moves the output position to the next tab stop.
A newline character (ASCII code 10) moves to the next line. In Python, the escape sequence '\n' represents this character. In text files on Linux and macOS, each line ends with an instance of this character.
A carriage return (ASCII code 13) moves to the beginning of the line. In Python, the escape sequence '\r' represents this character. In text files on Windows, each line ends with a a carriage return (13) followed by a newline (10) followed by , i.e. by the sequence '\r\n'.
A space (ASCII code 32) is used throughout text to separate words.

As indicated above, the representation of newlines is platform-specific (unfortunately, for historical reasons). However, in Python fortunately we don't usually need to be concerned with these platform-specific representations, because when Python reads a text file it maps any newline sequence to '\n'. Similarly, when it writes a text file it maps '\n' to the appropriate newline sequence for the operating system it's running on, e.g. '\r\n' on Windows.

The newer Unicode character set extends ASCII to include all characters in all languages of the world, including accented characters and also ideographic characters in Asian languages such as 日. Code points in Unicode range from 0 to 1,114,111.

The site unicode-table.com has a large table showing all the Unicode characters that exist.

Like most modern languages, Python is fully compatible with Unicode. You can write

s = 'Řehoř'

s = '人'

and these strings will work just like strings of ASCII characters.

chr() and ord()

Python includes two functions that can map between characters and their integer code points.

Given a Unicode character c, ord(c) returns its code point. For example, ord('A') is 65, and ord('B') is 66. ord('ř') is 345 (a value outside the ASCII range).

chr() works inversely: it maps a code point to a character. For example, ord(65) is 'A', and ord(345) is 'ř'.

These functions are sometimes useful when we wish to manipulate characters. For example, here's a program that reads a lowercase letter, and prints the next letter in the alphabet:

c = input('enter letter: ')
i = ord(c) - ord('a')
if 0 <= i < 26:
    i = (i + 1) % 26
    print('next letter is', chr(ord('a') + i))
else:
    print('not a lowercase letter')

The program uses ord() to convert a character (such as 'd') to a number (such as 3) representing its position in the lowercase alphabet. It then adds 1 (mod 26), and uses chr() to map the result back into a lowercase letter.

strings, continued

A string in Python is a sequence of Unicode characters. We previously saw that the len() function will give us the length of a string. We also saw how to retrieve characters from a string by index. Furthermore, we learned that we can use the operator '+' to concatenate strings, and '*' to repeat a string:

>>> s = 'watermelon'
>>> len(s)
10
>>> s[0]
'w'
>>> s[-1]
'n'
>>> s + ' pie'
'watermelon pie'
>>> s * 3
'watermelonwatermelonwatermelon'

Additionally, a built-in operator called 'in' works on strings. When s and t are strings, (s in t) checks whether s is a substring of t:

>>> 'water' in 'watermelon'
True
>>> 'melon' in 'watermelon'
True
>>> 'xyz' in 'watermelon'
False

string slicing

Going further, the syntax s[i : j] returns a slice (i.e. substring) of elements from s[i] up to (but not including) s[j]. Either i or j may be negative to index from the end of the sequence:

>>> s = 'watermelon'
>>> s[0:5]
'water'
>>> s[2:5]
'ter'
>>> s[5:-1]
'melo'
>>> s[-3:-1]
'lo'

In s[i : j], if the start index i is omitted, it is 0, i.e. the beginning of the string. If the end index j is omitted, it is len(s), i.e. the end of the string:

>>> s[:5]
'water'
>>> s[5:]
'melon'

The syntax s[i : j : k] will extract a slice of characters in which the index advances by k at each step. For example, if we use k = 2 then we will retrieve alternative characters:

>>> s[0:8:2]
'wtre'

Note that the step value can even be negative:

>>> s[6:2:-1]
'emre'

If the step value is negative, then an empty start index refers to the end of the string, and an empty end index refers to the beginning:

>>> s[:4:-1]
'nolem'
>>> s[4::-1]
'retaw'

You can reverse a string by specifying a step value of -1, and providing neither a start or end index:

>>> s[::-1]
'nolemretaw'

functions and methods

Python includes both functions and methods in its standard library.

A function takes one or more arguments and optionally returns a value. Some of Python's built-in functions that we've already seen in this course are len(), chr(), ord(), input() and print(). To call a function, we simply write its name followed by the arguments:

name = input('Enter your name: ')

A method is like a function, but is invoked on a particular object. For example:

s = 'yoyo'
b = s.startswith('yo')  # method call

In the second line above, we are invoking (or calling) the startswith() method on the string s. We pass the string 'yo' to the method. The method returns a value, which is True in this case since the string 'yoyo' does start with 'yo'.

Above, we said that a method is invoked on an object. In Python any value is an object, so (for example) 3, False, and 'yoyo' are all objects. (In some other languages, there is a technical distinction between objects and other kinds of values.) Methods (and a related feature, classes, which we'll discuss later) are fundamental building blocks in object-oriented programming. A language that has methods and classes, such as Python, C++, Java, or C#, is called object-oriented.

Not all programming languages have both functions and methods. For example, C has only functions, and classic Java has only methods. Python is a bit of a hybrid since it has both functions and methods. This arguably makes the language more flexible and convenient (at the cost of some complexity).

In this course we will soon learn how to write our own functions, and before too long we'll learn how to write our own methods (and classes) as well.

more string methods

Python's library contains many more useful methods on strings. For example, .lower() converts all characters in a string to lowercase:

>>> s = 'YUMMY PIE'
>>> s.lower()
'yummy pie'
>>> s
'YUMMY PIE'

Notice that the call to lower() above returned a new string that was like s, but in which all characters are lowercase. It did not modify the original string s, which still contains uppercase characters. In fact, it could not possibly modify s, since Python strings are immutable. Many string methods are similar to lower() in that they return a new string derived by modifying a given string in some way.

Our quick reference lists more string methods and operators. Note that strings are iterable, since you can loop over them with 'for'. They are also sequences, since you can access string elements using the syntax s[i]. Soon we'll see other kinds of iterables and sequences. (In Python every sequence is iterable, but some iterables such as sys.stdin are not sequences). So if you're looking in the quick reference for operations that work on strings, you can find them in three places: in the section on iterables, in the section on sequences, and in the section specifically about strings.

reading lines, revisited

Earlier, we saw that we can loop over sys.stdin to read lines from a file. You should be aware that when you do this, each string you receive will end with a newline character. Consider this program print.py, which reads all lines from standard input and copies them to standard output:

import sys

for line in sys.stdin:
    print(line)

Suppose that we have a text file story.txt with three lines:

the beginning

the middle

the end

Let's run the program above and redirect its input from this file:

$ python print.py < story.txt
the beginning

the middle

the end

$

Notice the extra blank lines after each output line. As mentioned above, each string generated by the 'for' loop will end with a newline character. For example, the first line read from the file will be 'the beginning\n'. (As we saw in an earlier section, on Windows the file will actually contain '\r\n' at the end of the line, but Python will convert this sequence to '\n'.) When we invoke print() on this string, it prints the newline in the string, and then prints a second newline because print() normally prints a newline after any output string you give it.

If we don't want the extra lines, we can call the strip() method to remove the newlines returned by 'for'. strip() removes all whitespace at the beginning and end of a string. Whitespace includes unprintable characters such as spaces and newlines:

>>> '   one   two   three   '.strip()
'one   two   three'
>>> 'down the street\n'.strip()
'down the street'

Let's modify the program print.py() above so that it strips each line read from standard input:

import sys

for line in sys.stdin:
    line = line.strip()
    print(line)

Now it won't print extra blank lines:

$ py print.py < story.txt
the beginning
the middle
the end
$

Alternatively, if want to remove only the newline character at the end of the line but leave all other whitespace intact, then instead of calling strip() we could call

line = line[:-1]

lists

Lists are a fundamental type in Python. We can make a list by specifying a series of values surrounded by square brackets:

l = [3, 5, 9, 11, 15]

A list may contain values of various types:

l = ['horse', 789, False, -22.3]

It may contain any number of values, or may even be empty:

l = []

The len function returns the number of elements in a list:

len(['potato', 'tomato', 'tornado'])    # returns 3

We can access elements of a list by index. The first element has index 0, and the last element has index len(l) – 1:

>>> l = [3, 5, 9, 11, 15]
>>> l[0]
3
>>> l[4]
15

Just like with strings, we can use negative indices to count from the end of the list:

>>> l = [3, 5, 9, 11, 15]
>>> l[-1]
15
>>> l[-2]
11

Slice syntax works with lists, just like with strings:

>>> l = [3, 5, 9, 11, 15]
>>> l[1:3]
[5, 9]
>>> l[3:]
[11, 15]

The 'in' operator tests whether a list contains a given value:

>>> 77 in [2, 8, 77, 3, 1]
True

Note that this is a bit different than 'in' on strings. The 'in' operator does not test whether a sublist is present in a list:

>>> [8, 77] in [2, 8, 77, 3, 1]
False

Unlike strings, lists in Python are mutable. We can set values by index:

>>> l = [3, 5, 9, 11, 15]
>>> l[0] = 77
>>> l[3] = 99
>>> l
[77, 5, 9, 99, 15]

more list operations

A list's length may change over time. The append() method adds a single element to a list:

>>> l = [3, 5, 9, 11, 15]
>>> l.append(20)
>>> l.append(30)
>>> l
[3, 5, 9, 11, 15, 20, 30]

We'll often use append() to build up a list in a loop. For example, we can build a list of the squares of all numbers from 1 to 10:

l = []
for i in range(1, 11):      # 1 .. 10
    l.append(i * i)

The extend() method adds a series of elements to a list. The += operator is a synonym for extend():

>>> l = [2, 4, 6]
>>> l.extend([8, 10])
>>> l
[2, 4, 6, 8, 10]
>>> l += [12, 14]
>>> l
[2, 4, 6, 8, 10, 12, 14]

The insert() method inserts an element into a list at a given position:

>>> l = [3, 5, 9, 11, 15]
>>> l.insert(2, 88)
>>> l
[3, 5, 88, 9, 11, 15]

The del operator can delete one or more elements of a list by index:

>>> l = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'grape']
>>> del l[1]
>>> l
['orange', 'pear', 'banana', 'kiwi', 'grape']
>>> del l[2:4]
>>> l
['orange', 'pear', 'grape']

We may even assign to a slice in a list, replacing that slice with an arbitrary sequence of values:

>>> l = [2, 4, 6, 8, 10]
>>> l[1:3]
[4, 6]
>>> l[1:3] = [100, 200, 300]
>>> l
[2, 100, 200, 300, 8, 10]

The list() function converts any sequence to a list:

>>> list('watermelon')
['w', 'a', 't', 'e', 'r', 'm', 'e', 'l', 'o', 'n']
>>> list(range(120, 130))
[120, 121, 122, 123, 124, 125, 126, 127, 128, 129]

Like strings, lists are iterable, so you can loop over them using 'for'. Lists are also sequences, since you can access their elements using the syntax l[i]. So you can find list operations in three sections in our quick reference guide, namely the sections about iterables, sequences, and specifically about lists.

lists are arrays

A Python list is actually an array, meaning a sequence of elements stored in contiguous memory locations. In fact it is a dynamic array, since it can expand over time. (In many programming languages, arrays have a fixed size).

For this reason, we can retrieve or update any element of a list by index extremely quickly, in constant time. Even if a list l has 1,000,000,000 elements, accessing e.g. l[927_774_282] will be extremely fast, just as fast as accessing the first element of a short list.

In our algorithms class, we will usually use the term "array" to describe this kind of data structure.

structural and reference equality

Suppose that we write the following declarations:

>>> l = [3, 5, 7, 9]
>>> m = l

Now the variables l and m refer to the same list. If we change l[0], then the change will be visible in m:

>>> l[0] = 33
>>> m[0]
33

This works because in fact in Python every variable is a pointer to an object. So two variables can point to the same objects, such as the list above. An assignment "m = l" does not copy a list. It runs in constant time, and is extremely fast.

Alternatively, we may make a copy of the list l. There are several possible ways to do that, all with the same effect:

>>> l = [3, 5, 7, 9]
>>> n = l.copy()      # technique 1: call the copy() method
>>> n = list(l)       # technique 2: call the list() function
>>> n = l[:]          # technique 3: use slice syntax

Now the list n has the same values as l, but it is a different list. Changes in one list will not be visible in the other:

>>> l[1] = 575
>>> l
[3, 575, 7, 9]
>>> n
[3, 5, 7, 9]

Python provides two different operators for testing equality. The first is the == operator:

>>> x == y
True
>>> x == z
True

This operator tests for structural equality. In other words, given two lists, it compares them element by element to see if they are equal. (It will even descend into sublists to compare elements there as well.)

The second equality operator is the is operator:

>>> x is y
True
>>> x is z
False

This operator tests for reference equality. In other words, it returns true only if its arguments actually refer to the same object. (Reference equality is also called physical equality).

You may want to use each of these operators in various situations. Note that is returns instantly (it runs in constant time), whereas == may traverse a list in its entirety, so it may be significantly slower.