Introduction to Algorithms, 2021-2
Week 11: Notes

Some of this week's topics are covered in Problem Solving with Algorithms:

And in Introduction to Algorithms:

Here are some additional notes.

hash functions

A hash function maps values of some type T to integers in a fixed range. Often we will want a hash function that produces values in the range 0 .. (N – 1), where N is a power of two. Hash functions are very useful, and are a common building block in programming and also in theoretical computer science.

In general, there may be many more possible values of T than integers in the output range. This means that hash functions will inevitably map some distinct input values to the same output value; this is called a hash collision. A good hash function will produce relatively few collisions in practice. In other words, even if two input values are similar to each other, they should be unlikely to have the same hash value. An ideal hash function will produce hash collisions in practice no more often than would be expected if it were producing random outputs.

As a first example, let's think about how to construct a hash function that maps an integer of any size, i, to a hash value in the range 0 .. (N – 1), for some fixed value of N such as N = 232. There is an obvious choice for the function: we can take i mod N.

As a next example, let's build a hash function that maps a pair of integers (i, j) to a hash value in the range 0 .. (N – 1). We might first consider the hash function (i + j) mod N. However, this is a poor choice. For example, (0, 2), (1, 1), and (2, 0) will all have the same hash value, and it's easy to imagine that we will encounter all of these input values in practice.

A better choice of hash function would be (Ki + j) mod N for some constant K. However, some values of K will be better than others. For example, suppose that N = 232, and we choose the constant K = 230. Then (0, 0), (4, 0), and (8, 0) will all have the same hash value 0, since

(4 · 230) mod 232 = 232 mod 232 = 0

and

(8 · 230) mod 232 = 233 mod 232 = 0

In fact, with this hash function any pair (i, 0) will always have a hash value that is 0, 1, 2, or 3, even though these are only 4 out of 232 possible output values for the hash function in general!

In general, we will get the best results if K and N are relatively prime, i.e. they have no prime factors in common. Furthermore, we probably don't want K to be too small, to avoid hash collisions for small values of i and j. As one example, the prime number 1,000,003 could be a reasonable choice for K.

hashing strings

In practice, we will very often want a hash function that takes strings as input. Suppose that we want a hash function that takes strings of characters with ordinal values from 0 to 255 (i.e. with no fancy Unicode characters) and produces 32-bit hash values in the range 0 ≤ v < 232.

As one idea, we could add the ordinal values of all characters in the input string:

def hash(s):
    return sum([ord(c) for c in s])

This is a poor hash function. If the input strings are short, then the output values will always be small integers. Furthermore, two input strings that contain the same set of characters (a very common occurrence) will hash to the same number.

Here is one way to construct a better hash function, called modular hashing. Given any string s, consider it as a series of digits forming one large number H. For example, if our characters have ordinal values in the range from 0 to 255, we can imagine them to be digits in base 256. Then we can compute H mod N for some constant N, producing a hash value in the range from 0 to N – 1.

Here is Python code that implements this idea:

# Generate a hash code in the range 0 .. 2^32 – 1
# DO NOT USE – this hash function is poor (see text below)
def my_hash(s):
    h = 0
    for c in s:
        d = ord(c)    # 0 .. 255
        h = 256 * h + d
    return h % (2 ** 32)

As you can see, this code is using the algorithm for combining digits in any base that we learned in one of the very first lectures of this class.

Unfortunately, this hash function is still poor. We can see that it often maps similar strings to the same hash value:

>>> my_hash('bright')
1768384628
>>> my_hash('light')
1768384628
>>> my_hash('night')
1768384628

The problem is that if we have a number H in base 256, then H mod 232 is exactly the last four digits of the number, because 232 = (28)^4 = 2564. If that's not obvious, consider the same phenomenon in base 10: 2276345 mod 10000 = 6345, because 1000 = 104. And so this hash function only depends on the last four characters in the string.

More generally, if B is the base of our digits (e.g. B = 256 here), and N is the size of the output range (e.g. N = 232 here), then we will probably get the best hash behavior if B and N are relatively prime. So if we want a better hash function, we must change B or N. Assuming that we want to produce values in a certain given range, we'd like to keep N as it is. So let's change B. In fact it will probably be best if B is not too close to a power of two (for number-theoretic reasons that we won't go into here). A good choice for B might be the prime number 1,000,003 that we saw above. Let's modify our hash function to use it:

# Generate a hash code in the range 0 .. 2^32 - 1
def my_hash(s):
    h = 0
    for c in s:
        d = ord(c)    # 0 .. 255
        h = 1_000_003 * h + d
    return h % (2 ** 32)

Now we get distinct values for the strings we saw above:

>>> my_hash('bright')
2969542966
>>> my_hash('light')
1569733066
>>> my_hash('night')
326295660

To be clear, we are now considering the input string to be a series of digits in base 1,000,003! This also means that our input string can now reasonably contain any Unicode characters that we like.

A disadvantage of the function above is that it computes an integer h that will be huge if the input string is large, since it encodes all of the characters in the string. That may be inefficient, and in fact many programming languages don't support large integers of this sort.

However, we can make a tiny change to the code so that it will compute the same output values, but be far more efficient. Rather than taking the result mod 232 at the end, we can perform it at every step of the calculation:

# Generate a hash code in the range 0 .. 2^32 - 1
def my_hash(s):
    h = 0
    for c in s:
        d = ord(c)    # 0 .. 255
        h = (1_000_003 * h + d) % (2 ** 32)
    return h

This function computes the same hash values that the previous version did:

>>> my_hash('bright')
2969542966
>>> my_hash('light')
1569733066
>>> my_hash('night')
326295660

Why does this trick work? A useful mathematical fact is that if you're performing a series of additions and multiplications and you want the result mod N, you can actually perform a (mod N) operation at any step along the way, and you will still get the same result. I won't prove this statement here, but ultimately it is true because of this fact, which you might see in a number theory course: if a ≡ b (mod N) and c ≡ d (mod N), then a + c ≡ b + d (mod N) and ac ≡ bd (mod N). In any case, this is especially useful in lower-level languages such as C that have fixed-size integers, because arithmetic operations in those languagues automatically compute the result mod N for some fixed N (where typically N = 232 or N = 264).

hash tables

A hash table is a data structure used to implement either a set of keys, or a map from keys to values. A hash table is often more efficient than a binary tree (which we can also be used to implement a set or map, as we saw recently). Hash tables are simple, and do not require complex code to stay balanced as is the case for binary trees. For this reason, hash tables are very widely used, probably even more so than binary trees for storing arbitrary maps from keys to values.

The most common method for implementing a hash table is chaining, in which the table contains an array of buckets. Each bucket contains a hash chain, which is a linked list of keys (or key/value pairs in the case of a map). For example, here is a picture of a small hash table with 4 buckets that stores a set of string keys:

In some hash table implementations, the array of buckets has a fixed size. In others, it can expand dynamically. For the moment, we will assume that the number of buckets is a constant B.

A hash table requires a hash function h(k) that can map each key to a bucket number. Typically we choose h(k) = h1(k) mod B, where h1 is a preexisting hash function that maps keys to larger integers (such as the my_hash function we wrote above). If a key k is present in a hash table, it is always stored in the hash chain in bucket h(k). In other words, the hash function tells us which bucket a key belongs in.

With this structure, it is straightforward to implement the set operations contains(), add(), and remove(). contains(x) walks down the hash chain in bucket h(x), looking for a node with value x. If it finds such a node in the chain, it returns True, otherwise False. add(x) will first call contains(x) to check whether the value to be added is already present in the table. If it if not, it will prepend a new node with value x to the hash chain in bucket h(x). remove(x) will look for a node with value x in the hash chain in bucket hash(x). If it finds such a node, it will delete it from the linked list.

Here is a partial implementation in Python of a hash table representing a set of objects. The remove() method is left as an exercise for you.

class Node:
    def __init__(self, key, next):
        self.key = key
        self.next = next

class HashSet:
    def __init__(self, numBuckets):
        # each array element is the head of a linked list of Nodes
        self.a = numBuckets * [None]
    def contains(self, x):
        i = hash(x) % len(self.a)     # hash bucket index
        p = self.a[i]
        while p != None:
            if p.val == x:
                return True
            p = p.next
        return False

    def add(self, x):
        if not self.contains(x):
            i = hash(x) % len(self.a)
            self.a[i] = Node(x, self.a[i])   # prepend to hash chain
    

It is straightforward to extend this hash set implementation to store a map from keys to values: all we have to do is store a key/value pair in each node, just like when we use a binary search tree to store a map.

We'd now like to consider the running time of hash table operations. Suppose that a hash table has N nodes in B buckets. Then its load factor α is defined as α = N / B. This is the average number of nodes per bucket, i.e. the average length of each hash chain.

Suppose that our hash function distributes keys evenly among buckets. Then any lookup in a hash table that misses (e.g. a contains() request for a key that is absent) will effectively be choosing a bucket at random. So it will examine α buckets on average as it walks the bucket's hash chain to look for the key. This shows that such lookups run in time O(α) on average, independent of N.

The analysis is a bit trickier for lookups that hit, since these are more likely to search a bucket that has a longer hash chain. Nevertheless it can also be shown that these run in time O(α) on average if the hash function distributes keys evenly. In fact all of our hash set operations (add, contains, remove) will run in O(α) on average.

So we can make hash table operations arbitrarily fast (on average) by keeping α small, i.e. by using as many hash buckets as needed. Of course, this supposes that we know in advance how many items we will be storing in a hash table, so that we can preallocate an appropriate number of buckets.

However, even if that number is not known, we can dynamically expand a hash table whenever its load factor grows above some fixed limit α0. To grow the table, we allocate a new bucket array, typically twice the size of the old one. Then we loop over all the nodes in the buckets in the old array, and insert them into the new array. We must recompute each key's hash value to find its position in the new array, which may not be the same as in the previous, smaller array.

Suppose that we start with an empty hash table and insert N values into it, doubling the number of buckets whenever the load factor exceeds some fixed value α0. Then how long will it take to insert the N values, as a function of N?

If we exclude the time for the doubling operations, then each insertion operation will run in O(1). That's because each insertion will run in O(α) (since it must traverse an existing hash chain to check whether the value being inserted is already present), and α will always be less than the constant α0.)

Now let's consider the time spent growing the table. The time to perform each doubling operation is O(M), where M is the number of elements in the hash table at the moment we perform the doubling. That's because we must rehash M elements, and each rehash operation takes O(1) since we can compute a hash value and prepend an element to a linked list in constant time. If there are initially k buckets in the hash table, then the first doubling operation will double the size to (2k). Considering the end of the process, the last doubling operation will double the size of the hash table to N. The the second-to-last operation will double it to (N / 2), the operation before that will double it to (N / 4), and so on. The total doubling time will be

O(k) + O(2k) + … + O(N / 4) + O(N / 2) + O(N)

O(1 + 2 + 4 + … + N / 4 + N / 2 + N)

= O(N)

So we can insert N elements in O(N), which means that insertion takes O(1) on average, even as the hash table grows arbitrarily large.

priority queues

In recent lectures we learned about stacks and queues, which are abstract data types that we can implement in various ways, such as using an array or a linked list.

A priority queue is another abstract data type. At a minimum, a priority queue might provide the following methods:

q.add(value)
Add a value to a priority queue.
q.is_empty()
Return true if the queue is empty.
q.remove_smallest()
Remove the largest value from a priority queue and return it.

A priority queue differs from a stack and an ordinary queue in the order in which elements are removed. A stack is last in first out: the pop function removes the element that was added most recently. An ordinary queue is first in first out: the dequeue function removes the element that was added least recently. In a priority queue, the remove_smallest method removes the element with the smallest value.

The interface above describes a min-queue, in which we can efficiently remove the smallest value. Alternatively we can build a max-queue, which has a remove_largest method that removes the largest value; this is more convenient for some applications. There is no fundamental differerence between these: any data structure that implements a min-queue can be trivially modified to produce a max-queue, by changing the direction of element comparisons.

In theory we could implement a priority queue using a binary search tree. If we did so and the tree was balanced, then add and remove_smallest would run in time O(log N), where N is the number of elements in the queue. But there are more efficient data structures for implementing priority queues, such as binary heaps, which we will discuss in the next lecture.