Week 7: Notes

'else' clauses in 'while' and 'for' statements

Python allows a 'while' or 'for' statement to contain an 'else' clause, which will execute if the loop does not exit via a 'break' statement. This is an unusual feature among programming languages, but can sometimes be convenient. For example, consider a program that checks whether a number is prime:

n = int(input())

i = 2
prime = True
while i * i <= n:
    if n % i == 0:
        prime = False
        break
    i += 1

if prime:
    print('prime')
else:
    print('composite')

We may rewrite this program without the boolean variable 'prime', by adding an 'else' clause to the 'while' statement:

n = int(input())

i = 2
while i * i <= n:
    if n % i == 0:
        print('prime')
        break
    i += 1
else:
    print('composite')

command-line arguments

When we run a program from the command, we may specify command-line arguments. For example:

$ python prog.py hello one two three

In a Python program, sys.argv holds a list of all command-line arguments. Actually the first element in the list is the name of the program itself, and subsequent elements hold the arguments. For example, suppose that prog.py holds this program:

import sys

print(sys.argv)

Let's run it:

$ python3 prog.py hello one two three
['prog.py', 'hello', 'one', 'two', 'three']

As an example, let's write a program that imitates the Unix utility 'wc' ("word count"), which prints out the number of lines, words, and characters in a file:

import sys

if len(sys.argv) != 2:
    print(f'usage: {sys.argv[0]} <filename>')
    exit()

filename = sys.argv[1]
chars = words = lines = 0

with open(filename) as f:
    for line in f:
        chars += len(line)
        words += len(line.split())
        lines += 1

print(f'{lines} lines, {words} words, {chars} chars')

Notice that the program prints out a usage message if we attempt to invoke it without arguments. This is a good practice:

$ py wc.py
usage: wc.py <filename>

Let's use the program to count the lines, words and characters in its own source code:

$ py wc.py wc.py
16 lines, 43 words, 324 chars

This matches the counts printed by the Unix wc utility itself:

$ wc wc.py
 16  43 324 wc.py

Command-line programs will often accept arguments with names beginning with a hyphen, such as -d or ‑e. Typically these denote options that affect a program's behavior.

As another example, let's implement a simple variant of the Unix utility 'grep', which searches for a string in a given file, and prints out all lines containing the string. Our program will accept two options:

-i: search case-insensitively, i.e. "horse" will match "HoRsE"
-v: invert matches, i.e. print out lines that do not contain the given string

Here's the program:

import sys

def usage():
    print(f'usage: {sys.argv[0]} [-i] [-v] <string> <filename>')
    exit()

insensitive = invert = False

i = 1
while i < len(sys.argv):
    arg = sys.argv[i]
    if arg.startswith('-'):
        if arg == '-i':
            insensitive = True
        elif arg == '-v':
            invert = True
        else:
            usage()
    else:
        break    # end of options
    i += 1

if len(sys.argv) - i < 2:
    usage()

find = sys.argv[i]
if insensitive:
    find = find.lower()
filename = sys.argv[i + 1]

with open(filename) as f:
    for line in f:
        line = line.rstrip()
        if insensitive:
            line = line.lower()
        match = find in line
        if (match and not invert) or (not match and invert):
            print(line)

Let's use it to find all lines containing "sys" in the program's own source code:

$ py grep.py -i sys grep.py
import sys
    print(f'usage: {sys.argv[0]} [-i] [-v] <string> <filename>')
while i < len(sys.argv):
    arg = sys.argv[i]
if len(sys.argv) - i < 2:
find = sys.argv[i]
filename = sys.argv[i + 1]

magic methods for operator overloading

Python's magic methods let us specify custom behavior for our classes. We have already seen two magic methods in Python: __init__ for initializing an object, and __repr__ for converting an object to a string.

Additional magic methods let us customize the behavior of the +, -, *, /, and // operators:

__add__
__sub__
__mul__
__truediv__
__floordiv__

More similar operators exist; you can read about them in the official Python reference. But the ones listed here are enough for us for now.

By the way, providing custom behavior for operators such as these is called operator overloading, and is possible in many, but not all, programming languages. (For example, Java has no operator overloading.)

To see how these magic methods may be used, let's revisit the Vec class that we wrote in the last lecture, representing a vector of arbitrary dimension:

class Vec:
    def __init__(self, *args):
        self.a = args

    # Generate a string representation such as [3 5 10].
    def __repr__(self):
        w = []
        for x in self.a:
            w.append(str(x))
        return '[' + ' '.join(w) + ']'

    def add(self, w):
        assert len(self.a) == len(w.a), 'vectors must have same dimension'
        sum = []

        for i in range(len(self.a)):
            sum.append(self.a[i] + w.a[i])

        return Vec(*sum)

Previously, we could add vectors using the add() method:

$ py -i vector.py 
>>> v = Vec(2.0, 4.0, 5.0)
>>> w = Vec(1.0, 2.0, 3.0)
>>> z = v.add(w)
>>> z
[3.0 6.0 8.0]

To allow vectors to be added using the + operator, we only need to change the name of the "add" method above to "__add__":

def __add__(self, w):
    assert len(self.a) == len(w.a), 'vectors must have same dimension'
    sum = []
    ...

Now we can add Vector objects using +:

$ py -i vector.py 
>>> v = Vec(2.0, 4.0, 5.0)
>>> w = Vec(1.0, 2.0, 3.0)
>>> z = v + w
>>> z
[3.0 6.0 8.0]

Behind the scenes, the '+' operator just calls the magic method, and in fact we can call it directly if we like:

>>> v.__add__(w)
[3.0 6.0 8.0]

Let's now implement the multiplication operator '*'. Actually there are two kinds of multiplication on vectors. We can multiply two vectors to compute their dot product, producing a scalar. Or we can multiply a vector by a scalar to produce another vector. We will implement both of these in a single method:

def __mul__(self, x):
    if isinstance(x, Vec):
        # Compute the dot product.
        s = 0
        for i in range(len(self.a)):
            s += self.a[i] * x.a[i]
        return s
    else:   # scalar multiplication
        b = []
        for i in range(len(self.a)):
            b.append(x * self.a[i])
        return Vec(*b)

Above, we've used the built-in function isinstance() that tests whether an object belongs to a given type.

Our multiplication operator works for both vector and scalar multiplication:

>>> v = Vec(2, 4, 5)
>>> v * v
45
>>> v * 10
[20 40 50]

Note, however, that with our existing code we cannot perform the scalar multiplication in the other direction:

>>> 10 * v
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for *: 'int' and 'Vec'

That's because Python will only invoke the __mul__ magic method (and similar methods) on the object that is on the left side of the multiplication operator.

However, there is an additional magic method __rmul__ that will be invoked on the object on the right side of the multiplication operator. (Similar methods __radd__, __rsub__ and so on also exist.) So let's implement __rmul__ now. Our implementation is simple:

def __rmul__(self, x):
    return self * x

Now we can multiply a scalar by a vector in that order:

>>> v = Vec(2, 4, 5)
>>> 10 * v
[20 40 50]

How does this work? When we write 10 * v, Python calls __rmul__(v, 10) since v is on the right. Then the line 'return self * x' computes v * 10, which in turn automatically invokes the magic method __mul__(v, 10).

sets

A set in Python is a useful built-in data structure that represents a mutable set of values. The function set() returns an empty set:

>>> set()
set()

Alternatively, you may construct a set of values by listing them within braces:

>>> s = {3, 5, 10}

By the way, do not attempt to create an empty set by writing '{}'. That is an empty dictionary, not a set! (We will discuss dictionaries below.)

We may add a value to a set using the add() method. If the value is already present, the call does nothing:

>>> s.add(4)
>>> s
{10, 3, 4, 5}
>>> s.add(4)
>>> s
{10, 3, 4, 5}

The remove() method removes a value from a set. If it's not present, Python will report an error:

>>> s.remove(5)
>>> s
{10, 3, 4}
>>> s.remove(5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 5

The 'in' operator tests whether a value is present in a set:

>>> s = { 'red', 'green', 'yellow', 'turquoise' }
>>> 'yellow' in s
True
>>> 'purple' in s
False

All of these operations are efficient: add(), remove(), and 'in' will run in O(1) on average. For this reason, a set is sometimes a better choice of data structure than a list, especially in situations when we want to be able to quickly test whether an element is present.

However, be aware that sets are unordered. In other words, the values in a set are not in any particular order, and you cannot access values by index:

>>> s = { 'red', 'green', 'yellow', 'turquoise' }
>>> s[2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object is not subscriptable

Python will print a set's values in an arbitrary order, unrelated to the order in which values were added:

>>> s = set()
>>> s.add(5)
>>> s.add(2)
>>> s.add(10)
>>> s
{2, 10, 5}

Sets are iterable, meaning that you can use a 'for' statement to loop over a set's values (in some arbitrary order):

>>> s = { 'red', 'green', 'yellow', 'turquoise' }
>>> for c in s:
...   print(c)
... 
green
turquoise
red
yellow

Note that every value in a set must be immutable. So you cannot create a set of lists, for example:

>>> s = { [1, 2, 3], [6, 5, 8], [10, 11, 12] }
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

However, a set of tuples will work fine, since tuples are immutable:

>>> s = { (1, 2, 3), (6, 5, 8), (10, 11, 12) }
>>> s
{(6, 5, 8), (1, 2, 3), (10, 11, 12)}
>>>

Python provides handy operators for performing Boolean operations on sets. The '&', '|', and '-' operators compute the intersection, union, or difference of two sets:

>>> s = { 5, 8, 2, 10, 3, 6, 12 }
>>> t = { 8, 77, 3, 14, 12 }
>>> s & t
{8, 3, 12}
>>> s | t
{2, 3, 5, 6, 8, 10, 12, 77, 14}
>>> s - t
{2, 10, 5, 6}

The set operators 's <= t' and 's >= t' test whether s is a subset or superset of t, respectively:

>>> s = { 2, 4, 6 }
>>> t = { 1, 2, 3, 4, 5, 6, 8 }
>>> s <= t
True
>>> t <= s
False

For more information about built-in operators and functions that work on sets, see our quick library reference.

dictionaries

A dictionary is a built-in Python data type that represents a mutable map from keys to values. Dictionaries are useful in many situations.

Here's a dictionary that maps the names of some animals in English to their Czech equivalents:

>>> d = {'cat' : 'kočka', 'dog' : 'pes', 'guinea pig' : 'morče'}

The keys in this dictionary are the strings 'cat', 'dog' and 'guinea pig'. The corresponding values are 'kočka', 'pes' and 'morče'. Notice that each key and its value are separated by a colon (':') in the declaration above.

We may look up the value for any key:

>>> d['dog']
'pes'
>>> d['guinea pig']
'morče'

We can use an assignment to add a new key/value pair to the dictionary:

>>> d['cow'] = 'kráva'
>>> d
{'cat': 'kočka', 'dog': 'pes', 'guinea pig': 'morče', 'cow': 'kráva'}

We may use the same syntax to replace the value at any key:

>>> d['cat'] = 'gato'
>>> d
{'cat': 'gato', 'dog': 'pes', 'guinea pig': 'morče', 'cow': 'kráva'}

The del statement will delete a key and its value from a dictionary:

>>> del d['dog']
>>> d
{'cat': 'gato', 'guinea pig': 'morče', 'cow': 'kráva'}

The in operator tests whether a key is present in a dictionary:

>>> 'guinea pig' in d
True
>>> 'horse' in d
False

All of these operations are efficient. Looking up a key's value, replacing a key's value, deleting a key, and testing for a key will all run in O(1) on average.

Notice that all keys in a dictionary are unique. Values, however, may be repeated. Also note that dictionaries are indexed by key, but not by value. In other words, you can quickly look up the value for any key, but if you want to know whether a particular value is present in the dictionary, you must scan the entire dictionary, which takes O(N).

Python provides methods that let us conveniently access the keys and values in a dictionary. First, d.keys() returns an iterable object containing all keys in the dictionary. You can iterate over this object using 'for', or convert it to a list:

>>> d = {'cat' : 'kočka', 'dog' : 'pes', 'guinea pig' : 'morče'}
>>> for k in d.keys():
        print(k)
cat
dog
guinea pig
>>> list(d.keys())
['cat', 'dog', 'guinea pig']

Similarly, d.values() lists all values in a dictionary. Another helpful method is d.items(), which is an iterable of key‑value pairs:

>>> list(d.items())
[('cat', 'kočka'), ('dog', 'pes'), ('guinea pig', 'morče')]

Recall that in a 'for' statement we may use pattern matching to assign to multiple variables on each loop iteration:

>>> for x, y in [(1, 2), (3, 4), (5, 6)]:
         print(x + y)
3
7
11

And so we may conveniently use pattern matching to loop over the keys and values of a dictionary simultaneously:

>>> d
{'cat': 'kočka', 'dog': 'pes', 'guinea pig': 'morče'}
>>> for key, value in d.items():
        print(f'key = {key}, value = {value}')
key = cat, value = kočka
key = dog, value = pes
key = guinea pig, value = morče

Note that dictionary keys, like values in a set, must be immutable. So you can't create a dictionary whose keys are lists:

>>> d = { [1, 2, 3]: 7, [5, 3, 2]: 8 }
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

However, tuples will work fine as dictionary keys:

>>> d = { (1, 2, 3): 7, (5, 3, 2): 8 }

See our quick library reference for more information about built-in operators and functions that work on dictionaries.

counting words

As an example of the practical utility of dictionaries, suppose that we'd like to determine which words in a file appear the most times. For example, here is the entire text of the classic novel War and Peace. Let's write a program to read the text and print out its most common words. We'll only consider words with at least 5 letters, since shorter common words (e.g. 'the', 'a') are not usually so interesting.

We'll assume that the text is in a file 'war_and_peace'. We'll read the file and construct a dictionary that maps each word to its count. After that, we'll extract a list of pairs (count, word) from the dictionary. Then we'll sort the list. Python orders tuples lexicographically, so that will sort by count since the count appears first in each pair.

Here is the program:

f = open('war_and_peace')

count = {}

for line in f:
    words = line.strip().split()
    for w in words:
        if len(w) >= 5:
            if w in count:
                count[w] += 1
            else:
                count[w] = 1

# Extract a list of pairs.
pairs = []
for word, n in count.items():
    pairs.append( (n, word) )

pairs.sort(reverse = True)

for n, word in pairs[:10]:
    print(f'{word}: {n}')

Let's run it:

$ py words.py
which: 1924
Prince: 1499
their: 1407
would: 1333
Pierre: 1260
could: 1090
about: 908
there: 895
CHAPTER: 730
Andrew: 700