Week 9: Notes

hash tables

A hash table is a data structure used to implement either a set of keys, or a map (i.e. dictionary) from keys to values. A hash table is often more efficient than a binary search tree, which can also be used to implement a set or map, as we saw recently. Hash tables are simple, and do not require complex code to stay balanced as is the case for binary trees. For this reason, hash tables are widely used in practice. (In fact Python stores sets and dictionaries as hash tables internally.)

One common method for implementing a hash table is chaining, in which the table contains an array of buckets. Each bucket contains a hash chain, which is a linked list of keys (or key/value pairs in the case of a map). For example, here is a picture of a small hash table with 4 buckets that stores a set of string keys:

In some hash table implementations, the array of buckets has a fixed size. In others, it can expand dynamically. For the moment, we will assume that the number of buckets is a constant B.

A hash table requires a hash function h(k) that can map each key to a bucket number. Typically we choose h(k) = h1(k) mod B, where h1 is a preexisting hash function that maps keys to larger integers in a fixed range (such as the my_hash() function we wrote in the previous lecture, or Python's built-in hash() function). If a key k is present in a hash table, it is always stored in the hash chain in bucket h(k). In other words, the hash function tells us which bucket a key belongs in.

With this structure, it is straightforward to implement the set operations contains(), add(), and remove():

Here is a partial implementation in Python of a hash table representing a set of objects. The remove() method is left as an exercise for you.

class Node:
    def __init__(self, key, next):
        self.key = key
        self.next = next

class HashSet:
    def __init__(self, num_buckets):
        # each array element is the head of a linked list of Nodes
        self.a = num_buckets * [None]

    def contains(self, x):
        i = hash(x) % len(self.a)     # hash bucket index
        p = self.a[i]
        while p != None:
            if p.val == x:
                return True
            p = p.next
        return False

    def add(self, x):
        if not self.contains(x):
            i = hash(x) % len(self.a)
            self.a[i] = Node(x, self.a[i])   # prepend to hash chain

    def remove(self, x):
        # Exercise: Implement this method.
    

It is straightforward to extend this hash set implementation to store a map from keys to values: all we have to do is store a key/value pair in each node, just like when we use a binary search tree to store a map.

We'd now like to consider the running time of hash table operations. Suppose that a hash table has N nodes in B buckets. Then its load factor α is defined as α = N / B. This is the average number of nodes per bucket, i.e. the average length of each hash chain.

Suppose that our hash function distributes keys evenly among buckets. Then any lookup in a hash table that misses (e.g. a contains() request for a key that is absent) will effectively be choosing a bucket at random. So it will examine α buckets on average as it walks the bucket's hash chain to look for the key. This shows that such lookups run in time O(α) on average, independent of N.

The analysis is a bit trickier for lookups that hit, since these are more likely to search a bucket that has a longer hash chain. Nevertheless it can also be shown that these run in time O(α) on average if the hash function distributes keys evenly. In fact all of our hash set operations (add, contains, remove) will run in O(α) on average.

Now, really we are interested in the running time of operations as a function of N, the number of values in the hash table. Suppose that we begin with an empty hash table with B buckets, and insert N values into it:

h = HashSet()
for i in range(N):
    h.add(i)

How long will this code take to run? When we perform the first insertion, there are no values in the table, so the load factor α is 0 / B = 0. On the next insertion, the load factor is 1 / B, and so on. Each insertion runs in O(α). So the total time will be

O(0 / B + 1 / B + 2 / B + ... + (N - 1) / B) = (1 / B) O(0 + 1 + 2 + ... + (N - 1)) = O(N2)

And so the average time for each of our N insertions is O(N).

A hash table with only one bucket is just a linked list. By using B buckets, we've effectively divided the items in the table into B linked lists, which makes our operations faster by a factor of B. However, if B is a constant, then the big-O running time of our operations is just the same as for a single linked list, since big-O is not affected by a constant factor. To put it differently, if B is a constant then the load factor α = N / B will grow arbitrary large as N grows large, and so our operations will become arbitrarily slow.

dynamically expanding a hash table

To achieve a better big-O running time, we can dynamically expand a hash table whenever its load factor grows above some fixed limit αmax. To grow the table, we allocate a new bucket array, typically twice the size of the old one. Then we loop over all the nodes in the buckets in the old array, and insert them into the new array. We must recompute each key's hash value to find its position in the new array, which may not be the same as in the previous, smaller array.

Suppose that we start with an empty hash table and insert N values into it, doubling the number of buckets whenever the load factor exceeds some fixed value αmax. Then how long will it take to insert the N values, as a function of N?

If we exclude the time for the doubling operations, then each insertion operation will run in O(1). That's because each insertion will run in O(α) (since it must traverse an existing hash chain to check whether the value being inserted is already present), and α will always be less than the constant αmax.

Now let's consider the time spent growing the table. The time to perform each doubling operation is O(M), where M is the number of elements in the hash table at the moment we perform the doubling. That's because we must rehash M elements, and each rehash operation takes O(1) since we can compute a hash value and prepend an element to a linked list in constant time. If there are initially k buckets in the hash table, then the first doubling operation will double the size to (2k). Considering the end of the process, the last doubling operation will double the size of the hash table to N. The the second-to-last operation will double it to (N / 2), the operation before that will double it to (N / 4), and so on. The total doubling time will be

O(k) + O(2k) + … + O(N / 4) + O(N / 2) + O(N)

O(1 + 2 + 4 + … + N / 4 + N / 2 + N)

= O(N)

So we can insert N elements in O(N), which means that inserting an item takes O(1) on average, even as the hash table grows arbitrarily large.

I won't provide Python code here, but implementing a dynamically expanding hash table is a useful exercise.

priority queues

In recent lectures we learned about stacks and queues, which are abstract data types that we can implement in various ways, such as using an array or a linked list.

A priority queue is another abstract data type. At a minimum, a priority queue might provide the following methods:

q.add(value)
Add a value to a priority queue.
q.is_empty()
Return true if the queue is empty.
q.remove_smallest()
Remove the largest value from a priority queue and return it.

A priority queue differs from a stack and an ordinary queue in the order in which elements are removed. A stack is last in first out: the pop function removes the element that was added most recently. An ordinary queue is first in first out: the dequeue function removes the element that was added least recently. In a priority queue, the remove_smallest() method removes the element with the smallest value.

The interface above describes a min-queue, in which we can efficiently remove the smallest value. Alternatively we can build a max-queue, which has a remove_largest() method that removes the largest value; this is more convenient for some applications. There is no fundamental differerence between these: any data structure that implements a min-queue can be trivially modified to produce a max-queue, by changing the direction of element comparisons.

In theory we could implement a priority queue using a binary search tree. If we did so and the tree was balanced, then add() and remove_smallest() would run in time O(log N), where N is the number of elements in the queue. But there are more efficient data structures for implementing priority queues, such as binary heaps, which we will now discuss.

binary heaps

A binary heap is a data structure that can efficiently implement a priority queue. Specifically, it has the form of a binary tree that satisfies two properties:

  1. Every node is smaller than or equal to its children. To be more precise, if a node with value p has a child with value c, then p ≤ c. This is sometimes called the heap property.

  2. All levels of the tree are complete except possibly the last level, which may be missing some nodes on the right side.

For example, here is a binary heap:

This heap looks like a complete binary tree with one node missing on the right in the last level.

The height of a binary heap with N nodes is ⌊log2(N)⌋ = O(log N). In other words, a binary heap is always balanced.

We don’t store a binary heap using dynamically allocated nodes and pointers. Instead, we use an array, which is more compact, and is possible because of the shape of the binary heap tree structure. The binary heap above can be stored in an array like this:

Notice that the array values are in the same order in which they appear in the tree, reading across tree levels from top to bottom and left to right.

From the tree above we can see the following patterns. If a heap node N has index i, then

In Python, we can store heap elements in an ordinary Python list (which, as we know, is actually a dynamic array):

class BinaryHeap:
    def __init__(self):
        self.a = []    # list will hold heap elements

Let's write helper functions that move between related tree nodes, using the formulas we saw above:

def left(i):
    return 2 * i + 1
    
def right(i):
    return 2 * i + 2
    
def parent(i):
    return (i - 1) // 2

The heap depicted above is a min-heap, which supports a remove_smallest() operation. If we like, we may trivially modify it to be a max-heap, which instead supports a remove_largest() operation. In a max-heap, every node has a value that is larger than its children, and the largest value is at the top.

heap operations

We will now describe operations that will let us use a min-heap as a priority queue.

We first consider inserting a value v into a heap. To do this, we first add v to the end of the heap array, expanding the array size by 1. Now v is in a new leaf node at the end of the last heap level. The heap still has a valid shape, except that it may be invalid since one node (the node we just added) may have a value v that is smaller than its parent.

So we now perform an operation called up heap, which will move v upward if necessary to restore the heap properties. Suppose that v’s parent has value p, and p > v. That parent may have a second child with value w. We begin by swapping the values v and p. Now v’s children are p and w (if present). We know that v < p. If w is present then p < w, so v < w by transitivity. Thus the heap properties have been restored for the node that now contains v. And now we continue this process, swapping v upward in the heap until it reaches a position where it is not larger than its parent, or until it reaches the root of the tree.

In Python, the insert operation will look like this:

    # in class BinaryHeap

    def up_heap(self, i):
        while i > 0:
            p = parent(i)
            if self.a[p] <= self.a[i]:
                break
            self.a[i], self.a[p] = self.a[p], self.a[i]
            i = p

    def insert(self, x):
        self.a.append(x)
        self.up_heap(len(self.a)  1)

The worst-case running time of an up heap operation is O(log N), since it will travel through O(log N) nodes at most (i.e. up to the top of the tree), and performs O(1) work at each step. And so insert() also runs in O(log N) in the worst case.

Now we consider removing the smallest value from a min-heap. Since the smallest value in such a heap is always in the root node, we must place some other value there. So we take the value at the end of the heap array and place it in the root, decreasing the array size by 1 as we do so. Now the heap still has a valid shape, however it may be invalid since its value at the root may be larger than one or both of its children.

So we now perform an operation called down heap on this value v, which will lower it to a valid position in the tree. If v is larger than one or both of its children, then let w be the value of the smallest of v's children. Then we know that v > w. We swap v with w. Now w's children are v and possibly some other value x. We know that w < v, and if there is another child x then w < x, since w was the smallest child. So the heap property has been restored for the node that now contains w. We then continue this process, swapping v downward as many times as necessary until v reaches a point where it has no larger children. The process is guaranteed to terminate successfully, since if v eventually descends to a leaf node there will be no children and the condition will be satisfied.

In Python, remove_smallest() will look like this. I've left down_heap() unimplemented as an exercise for you.

    def down_heap(self, i):
        # Exercise: Implement this method.

    def remove_smallest(self):
        x = self.a[0]
        self.a[0] = self.a.pop()
        self.down_heap(0)
        return x

Similarly to an up heap operation, the running time of a down heap operation is O(log N) in the worst case, since it will travel through O(log N) nodes at most (i.e. up to the top of the tree), and performs O(1) work at each step. And so remove_smallest() also runs in O(log N) in the worst case.

heapifying an array

Suppose that we have an array a of numbers in any order, and we'd like to heapify the array, i.e. rearrange its elements so that they form a valid binary heap. How can we do this?

One possible approach would to iterate over the array elements from left to right, performing an up heap operation on each one. After each such operation, the array elements visited so far will be a valid binary heap, so when we are done we will have a valid heap containing all array elements. If there are N elements in the array, then there may be as many as N / 2 leaf nodes. The up heap operation on each leaf may take O(log N) time, since it may travel all the way up the tree. Thus, in the worst case the heapification may take N / 2 · O(log N) = O(N log N).

However, there is a better way. We can instead iterate over the array elements from right to left, performing a down heap operation on each one. At each step, the heap property will be valid for all nodes that we've already visited, so when we're done will have a complete valid binary heap.

How long will this take? Suppose that there are N elements in the array. For simplicity, let's assume that N = 2k – 1 for some k, so that the heap has the form of a complete binary tree with height (k - 1). Then there are 2k – 1 leaves, which is approximately N / 2. We don't even need to perform a down heap operation on these, since there is nothing below them. There are 2k – 2 parents of leaves, which is approximately (N / 4). The down heap operation on each of these will take only one step, since each such node will fall at one most position. So the total running time will be proportional to

0 · (N / 2) + 1 · (N / 4) + 2 · (N / 8) + 3 · (N / 16) + …

Let's rewrite the expression above in a two-dimensional form:

N / 4 + N / 8 + N / 16 + N / 32 + N / 64 + …
      + N / 8 + N / 16 + N / 32 + N / 64 + …
              + N / 16 + N / 32 + N / 64 + … 
                       + N / 32 + N / 64 + …
                       + … 

The sum of the first row above is N / 2. The sum of the second row is N / 4, and so on. The total of all rows is

N / 2 + N / 4 + N / 8 + … = N

Thus the total cost of this heapification process is O(N). It is remarkably efficient.

heapsort

We’ve already seen several sorting algorithms in this course: bubble sort, selection sort, insertion sort, merge sort and quicksort. We can use a binary heap to implement yet another sorting algorithm: heapsort. The essence of heapsort is simple: given a array of numbers to sort, we put them into a heap, and then we pull out the numbers one by one in sorted order.

Let’s examine these steps in more detail. For heapsort, we want a max-heap, which lets us remove the largest element by calling a method remove_largest. Given an array of N values, we first heapify it as described above. After that, sorting the array is easy: we repeatedly call remove_largest to remove the largest element, which we place at the end of the array, to the left of any other elements that we have already removed. When we are done, the array will be sorted. A Python implementation will look like this:

class Heap:
    def __init__(self, a):
        self.a = a
        self.n = len(a)  # current heap size 

        # Heapify the given array.
        for i in range(self.n - 1, -1, -1):
            self.down_heap(i)
    
    def down_heap(self, i):
        # Exercise: Implement this method.

    def remove_largest(self):
        x = self.a[0]
        self.a[0] = self.a[self.n - 1]
        self.n -= 1
        self.down_heap(0)
        return x

    def sort(self):
        while self.n > 0:
            x = self.remove_largest()
            self.a[self.n] = x

def heapsort(a):
    h = Heap(a)
    h.sort()

The total time for heapsort to sort N values is (a) the time to heapify the array plus (b) the time to remove each of the N values from the heap in turn. This is

O(N) + O(N log N) = O(N log N)

Thus heapsort has the same asymptotic running time as merge sort and quicksort. Some advantages of heapsort are that it runs on an array in place (unlike merge sort) and that it is guaranteed to run in time O(N log N) for any input (unlike quicksort).

Here is a graph showing the relative performance of our Python implementations of merge sort, quicksort, and heapsort on random arrays containing up to 100,000 elements. Both randomized and non-randomized versions of quicksort are shown:

We can see that heapsort is substantially slower than quicksort. Nevertheless, it is sometimes used in environments where memory is limited and quicksort's worst-case quadratic behavior would be unacceptable. For example, the Linux kernel uses heapsort internally. Another advantage of heapsort is that its implementation is non-recursive, so there is no danger of overflowing the call stack, as could happen in quicksort with a worst-case input.