Introduction to Algorithms, 2020-1
Week 10: Notes

Some of this week's topics are covered in Problem Solving with Algorithms:

6.1 Heaps
6.2 Maintaining the heap property
11.2 Hash Tables

Here are some additional notes.

maps

We have already seen several abstract data types: stacks, queues, and sets. We've also seen various ways to implement them.

Another abstract data type is a map (or dictionary), which maps keys to values. It provides these operations:

m.add(key, value): Add a new (key, value) pair, or update an existing key if present.
m.remove(key): Remove a key and its associated value.
m.lookup(key): Look up a key and return its associated value, or None if absent.

A map cannot contain the same key twice. In other words, a map associates a key with exactly one value.

This type should be familiar, since we have used Python's dictionaries, which are an implementation of this abstract type.

hash tables

A hash table is a data structure used to implement either a set of keys, or a map from keys to values. A hash table is often more efficient than a binary tree (which we can also be used to implement a set or map, as we saw recently). Hash tables are simple, and do not require complex code to stay balanced as is the case for binary trees. For this reason, hash tables are very widely used, probably even more so than binary trees for storing arbitrary maps from keys to values.

The most common method for implementing a hash table is chaining, in which the table contains an array of buckets. Each bucket contains a hash chain, which is a linked list of keys (or key/value pairs in the case of a map).

In Python, here is how we might implement a hash table representing a set of objects:

class Node:
    def __init__(self, key, next):
        self.key = key
        self.next = next

class HashSet:
    def __init__(self, numBuckets):
        # each array element is the head of a linked list of Nodes
        self.a = numBuckets * [None]

In some hash table implementations, the array of buckets has a fixed size. In others, it can expand dynamically. For the moment, we will assume that the number of buckets is a constant B.

A hash table requires a hash function h(k) that can map each key to a bucket number. Typically we use a preexisting hash function h₁ that maps keys to larger integers, then let h(k) = h₁(k) mod B. If a key k is present in a hash table, it is always stored in the hash chain a[h(k)]. In other words, the hash function tells us which bucket a key belongs in.

Let's expand the HashSet class with a method contains that checks whether a value is present in a HashSet, and a method add that adds a value if not already present:

def contains(self, x):
    b = hash(x) % len(self.a)
    p = self.a[b]
    while p != None:
        if p.val == x:
            return True

        p = p.next
    return False

def add(self, x):
    if not self.contains(x):
        b = hash(x) % len(self.a)
        self.a[b] = Node(x, self.a[b])   # prepend to hash chain

We haven't implemented a remove method here to delete a key from a hash table, but that would be straightforward. To delete a key, we simply find its node in the hash bucket that contains it, and delete the node from the contaning linked list.

Suppose that a hash table has N key/value pairs in B buckets. Then its load factor α is defined as α = N / B. This is the average number of key/value pairs per bucket, i.e. the average length of each hash chain.

Suppose that our hash function distributes keys evenly among buckets. Then any lookup in a hash table that misses (either a get request for a key that is absent, or setting the value of a new key) will effectively be choosing a bucket at random. So it will examine α buckets on average as it walks the bucket's hash chain to look for the key. This shows that such lookups run in time O(α) on average, independent of N.

The analysis is a bit trickier for lookups that hit, since these are more likely to search a bucket that has a longer hash chain. Nevertheless it can also be shown that these run in time O(α) on average.

So we can make hash table lookups arbitrarily fast (on average) by keeping α small, i.e. by using as many hash buckets as needed. Of course, this supposes that we know in advance how many items we will be storing in a hash table, so that we can preallocate an appropriate number of buckets.

However, even if that number is not known, we can dynamically grow a hash table whenever its load factor grows above some fixed limit. To grow the table, we allocate a new bucket array, typically twice the size of the old one. Then we loop over all the nodes in the buckets in the old array, and insert them into the new array. We recompute each key's hash value to find its position in the new array, which may not be the same as in the previous, smaller array.

Suppose that we start with an empty hash table and insert N values into it, doubling the number of buckets whenever the load factor exceeds some fixed value α₀. Then how long will it take to insert the N values, as a function of N?

If we exclude the time for the doubling operations, then each insertion operation will run in O(1). That's because each insertion will run in O(α) (since it must traverse an existing hash chain to check whether the value being inserted is already present), and α will always be less than the constant α₀.)

Now let's consider the time spent growing the table. The time to perform each doubling operation is O(M), where M is the number of elements in the hash table at the moment we perform the doubling. That's because we must rehash M elements, and each rehash operation takes O(1) since we can compute a hash value and prepend an element to a linked list in constant time. Suppose that the hash table initially contains k buckets. Then we will perform the first doubling operation when there are (kα₀) values in the table. Let's also suppose that we perform the last doubling operation as we insert the last (Nth) item into the table. Then the total time for all the doubling operations will be

O( kα₀ + 2kα₀ + 4kα₀ + … + N)
= O(kα₀ · (1 + 2 + 4 + … + N / kα₀))
= O(1 + 2 + 4 + … + N / kα₀)
= O(N / kα₀)
= O(N)

So we can insert N elements in O(N), which means that insertion takes O(1) on average, even as the hash table grows arbitrarily large.

priority queues

In recent lectures we learned about stacks and queues, which are abstract data types that we can implement in various ways, such as using an array or a linked list.

A priority queue is another abstract data type. At a minimum, a priority queue might provide the following methods:

q.add(value): Add a value to a priority queue.
q.isEmpty(): Return true if the queue is empty.
q.removeLargest(): Remove the largest value from a priority queue and return it.

A priority queue differs from a stack and an ordinary queue in the order in which elements are removed. A stack is last in first out: the pop function removes the element that was added most recently. An ordinary queue is first in first out: the dequeue function removes the element that was added least recently. In a priority queue, the removeLargest method removes the element with the largest value.

The interface above describes a max-queue, in which we can efficiently remove the largest value. Alternatively we can build a min-queue, which has a removeSmallest method that removes the smallest value; this is more convenient for some applications. Generally any data structure that implements a max-queue can be trivially modified to produce a min-queue, by changing the direction of element comparisons.

In theory we could implement a priority queue using a binary search tree. If we did so and the tree was balanced, then add and removeLargest would run in time O(log N), where N is the number of elements in the queue. But there are more efficient data structures for implementing priority queues, such as binary heaps, to be discussed next.

binary heaps

A binary heap is a binary tree that satisfies two properties.

If a node with value p has a child with value c, then p ≥ c.
All levels of the tree are complete except possibly the last level, which may be missing some nodes on the right side.

For example, here is a binary heap:

tree

This heap looks like a complete binary tree with three nodes missing on the right in the last level.

The height of a binary heap with N nodes is ⌊log₂(N)⌋ = O(log N). In other words, a binary heap is always balanced.

Typically we don’t store a binary heap using dynamically allocated nodes and pointers. Instead, we use an array, which is possible because of the shape of the binary heap tree structure. The binary heap above can be stored in an array a like this:

Notice that the array values are in the same order in which they appear in the tree, reading across tree levels from top to bottom and left to right.

In Python, we can store a heap elements in an ordinary Python list (which, as we know, is actually a dynamically sizeable array):

class BinaryHeap:
    def __init__(self):
        self.a = []    # list will hold heap elements

Here is a tree showing the indices at which heap nodes are stored:

tree

From this tree we can see the following patterns. If a heap node N has index i, then

N’s left child (if any) has index 2i + 1
N’s right child (if any) has index 2i + 2
N’s parent (if any) has index (i - 1) // 2

So we can easily move between related tree nodes by performing index arithmetic:

def left(i):
    return (2 * i + 1)
    
def right(i):
    return (2 * i + 2)
    
def parent(i):
    return ((i - 1) // 2)

heap operations

We will now describe operations that will let us use a heap as a priority queue.

Suppose that a heap structure satisfies the heap properties, except that one node has a value v which is larger than its parent. An operation called up heap can move v upward to restore the heap properties. Suppose that v’s parent has value v₁. That parent may have a second child with value v₂. We begin by swapping v and its parent v₁. Now v’s children are v₁ and v₂ (if present). We know that v > v₁. If v₂ is present then v₁ > v₂, so v > v₂ by transitivity. Thus the heap properties have been restored for the node that now contains v. And now we continue this process, swapping v upward in the heap until it reaches a position where it is not larger than its parent, or until it reaches the root of the tree.

Now suppose that a heap structure satisfies the heap properties, except that one node N has a value v which is smaller than one or both of its children. We can restore the heap properties by performing a down heap operation, in which the value v moves downward in the tree to an acceptable position. Let v₁ be the value of the largest of N’s children. We swap v with v₁. Now N’s value is v₁, which restores the heap property for this node, since v₁ > v and v₁ is also greater than or equal to the other child node (if there is one). We then continue this process, swapping v downward as many times as necessary until v reaches a point where it has no larger children. The process is guaranteed to terminate successfully, since if v eventually descends to a leaf node there will be no children and the condition will be satisfied.

We can use the up heap and down heap operations to implement priority queue operations.

We first consider inserting a value v into a heap. To do this, we first add v to the end of the heap array, expanding the array size by 1. Now v is in a new leaf node at the end of the last heap level. We next perform an up heap operation on the value v, which will bring it upward to a valid position. An insert operation will always run in time O(log N).

Now we consider removing the largest value from a max-heap. Since the largest value in such a heap is always in the root node, we must place some other value there. So we take the value at the end of the heap array and place it in the root, decreasing the array size by 1 as we do so. We now perform a down heap operation on this value, which will lower it to a valid position in the tree. If there are N values in the heap, the tree height is ⌊log₂(N)⌋, so this process is guaranteed to run in O(log N) time.

Here is remove_largest in Python: