Week 6: Notes

sets

A set in Python is a useful built-in data structure that represents a mutable set of values. The function set() returns an empty set:

>>> set()
set()

Alternatively, you may construct a set of values by listing them within braces:

>>> s = {3, 5, 10}

By the way, do not attempt to create an empty set by writing '{}'. That is an empty dictionary, not a set! (We will discuss dictionaries below.)

We may add a value to a set using the add() method. If the value is already present, the call does nothing:

>>> s.add(4)
>>> s
{10, 3, 4, 5}
>>> s.add(4)
>>> s
{10, 3, 4, 5}

The remove() method removes a value from a set. If it's not present, Python will report an error:

>>> s.remove(5)
>>> s
{10, 3, 4}
>>> s.remove(5)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 5

The 'in' operator tests whether a value is present in a set:

>>> s = { 'red', 'green', 'yellow', 'turquoise' }
>>> 'yellow' in s
True
>>> 'purple' in s
False

All of these operations are efficient: add(), remove(), and 'in' will run in O(1) on average. For this reason, a set is sometimes a better choice of data structure than a list, especially in situations when we want to be able to quickly test whether an element is present.

However, be aware that sets are unordered. In other words, the values in a set are not in any particular order, and you cannot access values by index:

>>> s = { 'red', 'green', 'yellow', 'turquoise' }
>>> s[2]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object is not subscriptable

Python will print a set's values in an arbitrary order, unrelated to the order in which values were added:

>>> s = set()
>>> s.add(5)
>>> s.add(2)
>>> s.add(10)
>>> s
{2, 10, 5}

Sets are iterable, meaning that you can use a 'for' statement to loop over a set's values (in some arbitrary order):

>>> s = { 'red', 'green', 'yellow', 'turquoise' }
>>> for c in s:
...   print(c)
... 
green
turquoise
red
yellow

Note that every value in a set must be immutable. So you cannot create a set of lists, for example:

>>> s = { [1, 2, 3], [6, 5, 8], [10, 11, 12] }
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

However, a set of tuples will work fine, since tuples are immutable:

>>> s = { (1, 2, 3), (6, 5, 8), (10, 11, 12) }
>>> s
{(6, 5, 8), (1, 2, 3), (10, 11, 12)}
>>>

Python provides handy operators for performing Boolean operations on sets. The '&', '|', and '-' operators compute the intersection, union, or difference of two sets:

>>> s = { 5, 8, 2, 10, 3, 6, 12 }
>>> t = { 8, 77, 3, 14, 12 }
>>> s & t
{8, 3, 12}
>>> s | t
{2, 3, 5, 6, 8, 10, 12, 77, 14}
>>> s - t
{2, 10, 5, 6}

The set operators 's <= t' and 's >= t' test whether s is a subset or superset of t, respectively:

>>> s = { 2, 4, 6 }
>>> t = { 1, 2, 3, 4, 5, 6, 8 }
>>> s <= t
True
>>> t <= s
False

For more information about built-in operators and functions that work on sets, see our quick library reference.

dictionaries

A dictionary is a built-in Python data type that represents a mutable map from keys to values. Dictionaries are useful in many situations.

Here's a dictionary that maps the names of some animals in English to their Czech equivalents:

>>> d = {'cat' : 'kočka', 'dog' : 'pes', 'guinea pig' : 'morče'}

The keys in this dictionary are the strings 'cat', 'dog' and 'guinea pig'. The corresponding values are 'kočka', 'pes' and 'morče'. Notice that each key and its value are separated by a colon (':') in the declaration above.

We may look up the value for any key:

>>> d['dog']
'pes'
>>> d['guinea pig']
'morče'

We can use an assignment to add a new key/value pair to the dictionary:

>>> d['cow'] = 'kráva'
>>> d
{'cat': 'kočka', 'dog': 'pes', 'guinea pig': 'morče', 'cow': 'kráva'}

We may use the same syntax to replace the value at any key:

>>> d['cat'] = 'gato'
>>> d
{'cat': 'gato', 'dog': 'pes', 'guinea pig': 'morče', 'cow': 'kráva'}

The del statement will delete a key and its value from a dictionary:

>>> del d['dog']
>>> d
{'cat': 'gato', 'guinea pig': 'morče', 'cow': 'kráva'}

The in operator tests whether a key is present in a dictionary:

>>> 'guinea pig' in d
True
>>> 'horse' in d
False

All of these operations are efficient. Looking up a key's value, replacing a key's value, deleting a key, and testing for a key will all run in O(1) on average.

Notice that all keys in a dictionary are unique. Values, however, may be repeated. Also note that dictionaries are indexed by key, but not by value. In other words, you can quickly look up the value for any key, but if you want to know whether a particular value is present in the dictionary, you must scan the entire dictionary, which takes O(N).

Python provides methods that let us conveniently access the keys and values in a dictionary. First, d.keys() returns an iterable object containing all keys in the dictionary. You can iterate over this object using 'for', or convert it to a list:

>>> d = {'cat' : 'kočka', 'dog' : 'pes', 'guinea pig' : 'morče'}
>>> for k in d.keys():
        print(k)
cat
dog
guinea pig
>>> list(d.keys())
['cat', 'dog', 'guinea pig']

Similarly, d.values() lists all values in a dictionary. Another helpful method is d.items(), which is an iterable of key‑value pairs:

>>> list(d.items())
[('cat', 'kočka'), ('dog', 'pes'), ('guinea pig', 'morče')]

Recall that in a 'for' statement we may use pattern matching to assign to multiple variables on each loop iteration:

>>> for x, y in [(1, 2), (3, 4), (5, 6)]:
         print(x + y)
3
7
11

And so we may conveniently use pattern matching to loop over the keys and values of a dictionary simultaneously:

>>> d
{'cat': 'kočka', 'dog': 'pes', 'guinea pig': 'morče'}
>>> for key, value in d.items():
        print(f'key = {key}, value = {value}')
key = cat, value = kočka
key = dog, value = pes
key = guinea pig, value = morče

Note that dictionary keys, like values in a set, must be immutable. So you can't create a dictionary whose keys are lists:

>>> d = { [1, 2, 3]: 7, [5, 3, 2]: 8 }
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

However, tuples will work fine as dictionary keys:

>>> d = { (1, 2, 3): 7, (5, 3, 2): 8 }

See our quick library reference for more information about built-in operators and functions that work on dictionaries.

counting words

As an example of the practical utility of dictionaries, suppose that we'd like to determine which words in a file appear the most times. For example, here is the entire text of the classic novel War and Peace. Let's write a program to read the text and print out its most common words. We'll only consider words with at least 5 letters, since shorter common words (e.g. 'the', 'a') are not usually so interesting.

We'll assume that the text is in a file 'war_and_peace'. We'll read the file and construct a dictionary that maps each word to its count. After that, we'll extract a list of pairs (count, word) from the dictionary. Then we'll sort the list. Python orders tuples lexicographically, so that will sort by count since the count appears first in each pair.

Here is the program:

f = open('war_and_peace')

count = {}

for line in f:
    words = line.strip().split()
    for w in words:
        if len(w) >= 5:
            if w in count:
                count[w] += 1
            else:
                count[w] = 1

# Extract a list of pairs.
pairs = []
for word, n in count.items():
    pairs.append( (n, word) )

pairs.sort(reverse = True)

for n, word in pairs[:10]:
    print(f'{word}: {n}')

Let's run it:

$ py words.py
which: 1924
Prince: 1499
their: 1407
would: 1333
Pierre: 1260
could: 1090
about: 908
there: 895
CHAPTER: 730
Andrew: 700

The repr method

Above, we learned about __init__, which is a method with a special name that Python recognizes and which affects an object's behavior in a certain way. Python actually recognizes many different special method names, all of which begin and end with two underscores. Methods with these names are often called magic methods.

Another magic method in Python is called __repr__. If this method is defined in a class, Python calls it automatically whenever it needs to generate a string representation of an object. For example, this happens when you print out an object in the interactive Python console. By default, the string representation is just a blob of text with an ugly hexadecimal number:

>>> p = Point(3, 4)
>>> q = Point(10, 20)
>>> l = Line(p, q)
>>> p
<__main__.Point object at 0x7f173b1aa8b0>
>>> l
<__main__.Line object at 0x7f173b1aa7c0>

Let's add __repr__ methods to the Point and Line classes to define a nicer string representation for these classes:

class Point:
    ...
    def __repr__(self):
        return f'P({self.x}, {self.y})'

class Line
    ...
    def __repr__(self):
        return f'{self.p} – {self.q}'

Now Point and Line objects will print more nicely:

>>> p = Point(3, 4)
>>> q = Point(10, 20)
>>> l = Line(p, q)
>>> p
P(3, 4)
>>> l
P(3, 4) - P(10, 20)

Notice that in our __repr__ method in the Line class, we wrote 'self.p' and 'self.q' in curly braces in an f-string. In this situation, Python needs to convert self.p and self.q to strings, so it will call the __repr__ method of the Point class to perform that task.

magic methods for operator overloading

Python's magic methods let us specify custom behavior for our classes. We have already seen two magic methods in Python: __init__ for initializing an object, and __repr__ for converting an object to a string.

Additional magic methods let us customize the behavior of the +, -, *, /, and // operators:

__add__
__sub__
__mul__
__truediv__
__floordiv__

More similar operators exist; you can read about them in the official Python reference. But the ones listed here are enough for us for now.

By the way, providing custom behavior for operators such as these is called operator overloading, and is possible in many, but not all, programming languages. (For example, Java has no operator overloading.)

To see how these magic methods may be used, let's revisit the Vec class that we wrote in the last lecture, representing a vector of arbitrary dimension:

class Vec:
    def __init__(self, *args):
        self.a = args

    # Generate a string representation such as [3 5 10].
    def __repr__(self):
        w = []
        for x in self.a:
            w.append(str(x))
        return '[' + ' '.join(w) + ']'

    def add(self, w):
        assert len(self.a) == len(w.a), 'vectors must have same dimension'
        sum = []

        for i in range(len(self.a)):
            sum.append(self.a[i] + w.a[i])

        return Vec(*sum)

Previously, we could add vectors using the add() method:

$ python -i vector.py 
>>> v = Vec(2.0, 4.0, 5.0)
>>> w = Vec(1.0, 2.0, 3.0)
>>> z = v.add(w)
>>> z
[3.0 6.0 8.0]

To allow vectors to be added using the + operator, we only need to change the name of the "add" method above to "__add__":

def __add__(self, w):
    assert len(self.a) == len(w.a), 'vectors must have same dimension'
    sum = []
    ...

Now we can add Vector objects using +:

$ python -i vector.py 
>>> v = Vec(2.0, 4.0, 5.0)
>>> w = Vec(1.0, 2.0, 3.0)
>>> z = v + w
>>> z
[3.0 6.0 8.0]

Behind the scenes, the + operator just calls the magic method, and in fact we can call it directly if we like:

>>> v.__add__(w)
[3.0 6.0 8.0]

Let's now implement the multiplication operator *. Actually there are two kinds of multiplication on vectors. We can multiply two vectors to compute their dot product, producing a scalar. Or we can multiply a vector by a scalar to produce another vector. We will implement both of these in a single method:

def __mul__(self, x):
    if isinstance(x, Vec):
        # Compute the dot product.
        s = 0
        for i in range(len(self.a)):
            s += self.a[i] * x.a[i]
        return s
    else:   # scalar multiplication
        b = []
        for i in range(len(self.a)):
            b.append(x * self.a[i])
        return Vec(*b)

Above, we've used the built-in function isinstance() that tests whether an object belongs to a given type.

Our multiplication operator works for both vector and scalar multiplication:

>>> v = Vec(2, 4, 5)
>>> v * v
45
>>> v * 10
[20 40 50]

However, with our existing code we cannot perform the scalar multiplication in the other direction:

>>> 10 * v
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for *: 'int' and 'Vec'

That's because Python will only invoke the __mul__ magic method (and similar methods) on the object that is on the left side of the multiplication operator.

However, there is an additional magic method __rmul__ that will be invoked on the object on the right side of the multiplication operator. (Similar methods __radd__, __rsub__ and so on also exist.) So let's implement __rmul__ now. Our implementation is simple:

def __rmul__(self, x):
    return self * x

Now we can multiply a scalar by a vector in that order:

>>> v = Vec(2, 4, 5)
>>> 10 * v
[20 40 50]