Week 9: Notes

hash functions

A hash function maps values of some type T to integers in a fixed range. Often we will want a hash function that produces values in the range 0 .. (N – 1), where N is a power of 2. Hash functions are very useful, and are a common building block in programming and also in theoretical computer science.

In general, there may be many more possible values of T than integers in the output range. (For example, if T is the type of strings, there is an infinite number of possible values.) This means that hash functions will inevitably map some distinct input values to the same output value; this is called a hash collision. A good hash function will produce relatively few collisions in practice. In other words, even if two input values are similar to each other, they should be unlikely to have the same hash value. An ideal hash function will produce hash collisions in practice no more often than would be expected if it were producing random outputs.

Our immediate motivation for hash functions is that we'll use them to build hash tables, which we'll see in the next lecture. A hash table, like a binary tree, can be used to implement a set. Broadly speaking, if we have some values (e.g. strings) that we want to store in a hash table, we'll divide the values into some number K of distinct buckets. To decide which bucket to place a value V in, we'll compute a hash of the value V, producing an integer n, and will then compute a bucket number (n mod K). By dividing the values in this way, we can implement lookup operations efficiently. We want a hash function that produces numbers that are roughly evenly distributed among the range 0 .. (N - 1), so that our buckets will contain roughly equal numbers of values. That's all we need to know about hash tables for the moment; we'll discuss them in more detail in the next lecture.

hashing strings

In practice, we will often want a hash function that takes strings as input. As a starting point, suppose that we want a hash function that takes strings of characters with ordinal values from 0 to 255 (i.e. with no fancy Unicode characters) and produces hash values in the range 0 ≤ v < N for some constant N.

As one simple idea, we could add the ordinal values of all characters in the input string, then take the result mod N. For example, with N = 2³²:

# Generate a hash code in the range 0 .. 2^32 – 1.
def my_hash(s):
    return sum([ord(c) for c in s]) % (2 ** 32)

This is a poor hash function. If the input strings are short, then the output values will always be small integers. Furthermore, two input strings that contain the same set of characters (a very common occurrence) will hash to the same number.

Here is one way to construct a better hash function, called modular hashing. Given any string s, consider it as a series of digits forming one large number H. For example, since our characters have ordinal values in the range from 0 to 255, we can imagine them to be digits in base 256. This will give us an integer H, and then we can compute H mod N, producing a hash value in the range from 0 to N – 1.

Here is Python code that implements this idea, with N = 2³²:

# Generate a hash code in the range 0 .. 2^32 – 1
def my_hash(s):
    h = 0
    for c in s:
        d = ord(c)    # 0 .. 255
        h = 256 * h + d
    return h % (2 ** 32)

As you can see, this code is using the algorithm for combining digits in any base that we learned in one of the very first lectures of this class!

As we wrote above, a good hash function will allow us to distribute a set of values into a set of K buckets, with roughly the same number of values in each bucket. Specifically, we'll place each value v into the bucket h(v) mod K, where h is the hash function. To test this hash function, let's place all the words in this poem into 64 buckets, then plot the result.

import matplotlib.pyplot as plt

def my_hash(s):
  ...

with open('poem.txt') as f:
    words = { word for line in f for word in line.split() }

NUM_BUCKETS = 64
buckets = [0] * NUM_BUCKETS
for word in words:
    h = my_hash(word)
    b = h % NUM_BUCKETS
    buckets[b] += 1

plt.bar(range(64), buckets)
plt.show()

We see that the result is poor: the values are not evenly distributed at all.

In fact, our hash function often maps similar strings to the same hash value:

>>> my_hash('bright')
1768384628
>>> my_hash('light')
1768384628
>>> my_hash('night')
1768384628

The problem is that if we have a number H in base 256, then H mod 2³² is exactly the last four digits of the number, because 2³²= 256⁴. If that's not obvious, consider the same phenomenon in base 10: 2,276,345 mod 10,000 = 6,345, because 1,000 = 10⁴.

And so this hash function only depends on the last four characters in the string. By the way, here's another way to think about this: for any number N in binary, N mod 2³² gives you the last 32 bits of the number, and the last 32 bits depend on only the last four 8-bit characters.

In our poem hashing example above, we computed a bucket number h(v) mod K, where K = 64. This makes the problem even worse. If H is the value of the string as a base-256 number, then we are computing ((H mod 2³²) mod 64), which is the same as H mod 64 (since 2³² is divisible by 64). And because 256 is divisible by 64, that value depends on only the last character of the string:

>>> my_hash('watermelon') % 64
46
>>> my_hash('man') % 64
46
>>> my_hash('n') % 64
46

To put it differently, we're taking only the last 6 bits of the number H (since 64 = 2⁶), which depend only on the last (8-bit) character of the input string.

How can we improve the situation? Generally speaking, if B is the base of our digits (e.g. B = 256 here), and N is the size of the hash function output range (e.g. N = 2³² here), then we will probably get the best hash behavior if B and N are relatively prime, i.e. have no common factors. So if we want a better hash function, we must change B or N. Assuming that we want to produce values in a certain fixed range, we'd like to keep N as it is. So let's change B. In fact we'd like B to be larger anyway, so that we can process strings with Unicode characters, whose ordinal values might be as high as 10FFFF₁₆ = 1,114,112.

In fact it will probably be best if B is not too close to a power of 2 (for number-theoretic reasons that we won't go into here). A good choice for B might be the prime number 1,000,003, which is used in various popular implementations of hash functions, including Python's own built-in hash function. (Actually this is less than the highest possible Unicode value, so this might cause poor behavior if we have some Unicode characters with values higher than 1,000,003. Such characters are rare, so let's not worry about that.) To be clear, we will now consider the input string to be a series of digits in base 1,000,003!

Let's modify our hash function to use this value of B:

# Generate a hash code in the range 0 .. 2^32 - 1
def my_hash(s):
    h = 0
    for c in s:
        d = ord(c)
        h = 1_000_003 * h + d
    return h % (2 ** 32)

Now we get distinct values for the strings we saw above:

>>> my_hash('bright')
2969542966
>>> my_hash('light')
1569733066
>>> my_hash('night')
326295660

Let's now rerun the experiment we ran above, in which we use the hash function to place all words from a poem into one of 64 hash buckets:

This looks like a much better distribution.

Now, a disadvantage of our latest version of my_hash() is that it computes an integer h that will be huge if the input string is large, since it encodes all of the characters in the string. That may be inefficient, and in fact many programming languages don't support large integers of this sort.

However, we can make a tiny change to the code so that it will compute the same output values, but be far more efficient. Rather than taking the result mod 2³² at the end, we can perform it at every step of the calculation:

# Generate a hash code in the range 0 .. 2^32 - 1
def my_hash(s):
    h = 0
    for c in s:
        d = ord(c)    # 0 .. 255
        h = (1_000_003 * h + d) % (2 ** 32)
    return h

This function still computes the same hash values that the previous version did:

>>> my_hash('bright')
2969542966
>>> my_hash('light')
1569733066
>>> my_hash('night')
326295660

Why does this trick work? A useful mathematical fact is that if you're performing a series of additions and multiplications and you want the result mod N, you can actually perform a (mod N) operation at any step along the way, and you will still get the same result! I won't prove this statement here, but ultimately it is true because of this fact, which you may have seen in your math classes: if

a ≡ b (mod N) and c ≡ d (mod N)

then

a + c ≡ b + d (mod N)

and

a ⋅ c ≡ b ⋅ d (mod N) .

This fact is especially useful in lower-level languages such as C that have fixed-size integers, because arithmetic operations in those languages automatically compute the result mod N for some fixed N (where typically N = 2³² or N = 2⁶⁴).

hashing other data types

Of course, we may want to hash other data types, such as pairs, or lists. In theory, we could convert any value of any type to a string representation such as "[32, 57, 88]", then apply the string hashing function we wrote above. However that would not be too efficient. As a better idea, if we can convert a value of any other type to a large integer, then we can use the same modular hashing technique we saw above. For example, if we want to compute a hash code for a list of integers, we might first take each integer in the list mod P (where P is a large fixed prime), then concatenate all the bits of the resulting integers together (which is the same as considering it to be a number written in a series of digits in base P), then take that number mod N. I won't go into details about this here, but the same general idea will apply to most data types.

hash tables

A hash table is a data structure used to implement either a set of values (also called keys), or a map (i.e. dictionary) from keys to values. A hash table is often more efficient than a binary search tree, which can also be used to implement a set or map, as we saw recently. Hash tables are simple, and do not require complex code to stay balanced as is the case for binary trees. For this reason, hash tables are widely used in practice. (In fact Python stores sets and dictionaries as hash tables internally.)

One common method for implementing a hash table is chaining, in which the table contains an array of buckets. Each bucket contains a set of keys (or key/value pairs in the case of a map). We can store these either in a linked list, or a dynamic array. In Python, it's easier (and quite possibly more efficient) to use a dynamic array.

For example, here is a picture of a small hash table with 4 buckets that stores a set of string keys using dynamic arrays:

In some hash table implementations, the array of buckets has a fixed size. In others, it can expand dynamically. For the moment, we will assume that the number of buckets is a constant B.

A hash table requires a hash function h(k) that can map each key to a bucket number. Typically we choose h(k) = h₁(k) mod B, where h₁ is a preexisting hash function that maps keys to larger integers in a fixed range (such as the my_hash() function we wrote in the previous lecture, or Python's built-in hash() function). If a key k is present in a hash table, it is always stored in bucket h(k). In other words, the hash function tells us which bucket a key belongs in.

With this structure, it is straightforward to implement the set operations contains(), add(), and remove():

contains(x) looks for the value x in bucket h(x) and returns True if it finds it, otherwise False.
add(x) adds the value x to bucket h(x) if it is not already present there.
remove(x) removes the value x from bucket h(x) if it is present.

Here is an implementation in Python of a hash table representing a set of objects, using dynamic arrays (i.e. Python lists) to hold the values in each bucket:

class HashSet:
    def __init__(self, num_buckets):
        self.a = [[] for i in range(num_buckets)]

    def contains(self, x):
        i = hash(x) % len(self.a)     # hash bucket index
        return x in self.a[i]

    def add(self, x):
        i = hash(x) % len(self.a)
        if not x in self.a[i]:
            self.a[i].append(x)

    def remove(self, x):
        i = hash(x) % len(self.a)
        if x in self.a[i]:
            self.a[i].remove(x)

It is straightforward to extend this hash set implementation to store a map from keys to values: all we have to do is store key/value pairs in place of keys (similarly to when we use a binary search tree to store a map).

We'd now like to consider the running time of hash table operations. Suppose that a hash table has N keys in B buckets. Then its load factor α is defined as α = N / B. This is the average number of keys per bucket.

Suppose that our hash function distributes keys evenly among buckets. Then any lookup in a hash table that misses (e.g. a contains() request for a key that is absent) will effectively be choosing a bucket at random. So it will examine α values on average as it walks the bucket's array or linked list to look for the key. This shows that such lookups run in time O(α) on average, independent of N.

The analysis is a bit trickier for lookups that hit, since these are more likely to search a bucket that has more values. Nevertheless it can also be shown that these run in time O(α) on average if the hash function distributes keys evenly. In fact all of our hash set operations (add, contains, remove) will run in O(α) on average.

Now, really we are interested in the running time of operations as a function of N, the number of values in the hash table. Suppose that we begin with an empty hash table with B buckets, and insert N values into it:

h = HashSet()
for i in range(N):
    h.add(i)

How long will this code take to run? When we perform the first insertion, there are no values in the table, so the load factor α is 0 / B = 0. On the next insertion, the load factor is 1 / B, and so on. Each insertion runs in O(α). So the total time will be

O(0 / B + 1 / B + 2 / B + ... + (N - 1) / B) = (1 / B) O(0 + 1 + 2 + ... + (N - 1)) = O(N²)

And so the average time for each of our N insertions is O(N).

A hash table with only one bucket is just a linked list or dynamic array. By using B buckets, we've effectively divided the items in the table into B linked lists or arrays, which makes our operations faster by a factor of B. However, if B is a constant, then the big-O running time of our operations is just the same as for a single linked list or array, since big-O is not affected by a constant factor. To put it differently, if B is a constant then the load factor α = N / B will grow arbitrary large as N grows large, and so our operations will become arbitrarily slow.

dynamically expanding a hash table

To achieve a better big-O running time, we can dynamically expand a hash table whenever its load factor grows above some fixed limit α_max. To grow the table, we allocate a new bucket array, typically twice the size of the old one. Then we loop over all the values in the buckets in the old array, and insert each one into a bucket in the new array. We must recompute each key's hash value to find its correct bucket in the new array, which may not be the same as in the previous, smaller array.

Suppose that we start with an empty hash table and insert N values into it, expanding the number of buckets by a factor of 2 whenever the load factor exceeds some fixed value α_max. Then how long will it take to insert the N values, as a function of N?

If we exclude the time for the expansion operations, then each insertion operation will run in O(1). That's because each insertion will run in O(α) (since it must look in an existing bucket to check whether the value being inserted is already present), and α will always be less than the constant α_max.

Now let's consider the time spent growing the table. Consider a single table expansion, and let M be the number of values in the hash table at the moment we perform the expansion. We must rehash M values. We can compute a hash value in O(1), and can add an element to a linked list (by prepending it) or to a dynamic array in constant time on average. So the time to perform the expansion is O(M). If there are initially k buckets in the hash table, then the first expansion will double the size to (2k). Considering the end of the process, the last expansion will double the size of the hash table to N. The the second-to-last operation will double it to (N / 2), the operation before that will double it to (N / 4), and so on. The total expansion time will be

O(k) + O(2k) + … + O(N / 4) + O(N / 2) + O(N)

≤ O(1 + 2 + 4 + … + N / 4 + N / 2 + N)

= O(N)

So we can insert N values in O(N), which means that inserting an item takes O(1) on average, even as the hash table grows arbitrarily large.

I won't provide Python code here, but implementing a dynamically expanding hash table is a useful exercise.