Some of this week's topics are covered in Problem Solving with Algorithms:
And in Introduction to Algorithms:
11. Hash Tables
11.2 Hash tables
11.3 Hash functions
Here are some additional notes.
A hash function maps values of some type T to integers in a fixed range. Hash functions are commonly used in programming and in computer science in general.
We can construct hash functions in various ways. A good hash function will produce values that are uniformly distributed among its output range in practice, even when input values are similar to each other. Additionally, a small change to the function's input should ideally produce a large and unpredictable change in its output.
In general, there may be many more possible values of T than integers in the output range. This means that hash functions will inevitably map some distinct input values to the same output value; this is called a hash collision. A good hash function will produce hash collisions in practice no more often than would be expected if it were producing random outputs.
Suppose that we want a hash function that takes strings of ASCII characters (i.e. with ordinal values from 0 to 127) and produces 32-bit hash values in the range 0 ≤ v < 232. Here is a terrible hash function for that purpose:
def hash(s): return ord(s[0])
This hash function is poor because it only uses the first character of the string, so all strings with the same first character will end up with the same hash value. Furthermore, the function will produce only a small subset of the possible values in the output range. If the strings contain ASCII characters, then each character's ordinal value is at most 128.
As another idea, we could add the ordinal values of all characters in the input string:
def hash(s): return sum(ord(c) for c in s)
This is also a poor hash function. If the input strings are short, then the output values will always be small integers. Furthermore, two input strings that contain the same set of characters (a common occurrence) will hash to the same number.
Here is one way to construct a better hash function, called modular hashing. Given any string s, consider it as a series of digits forming one large number N. For example, if the string contains ASCII characters, then these will be digits in base 128. Now we compute N mod M for some constant M, producing a hash value in the range from 0 to M – 1. In the computation, it is best to take the result mod M as we process each digit. That produces the same result as if we performed a single modulus operation at the end, but is more efficient because we avoid generating large integers:
# Given a string of characters whose ordinal values range from 0 to (D – 1), # produce a hash value in the range 0 .. M – 1. def hash(s): h = 0 for c in s: h = (h * D + ord(c)) % M return h
(Recall that we saw how to combine digits of a number in any base back in lecture 2. We are performing the same operation here, only modulo M.)
This hash function will still be terrible if M is a multiple of D. For example if we let D = 256 (since we have a string of 8-bit characters) and M = 232,, then the hash code's output will only depend on the last 4 characters in each string! That's because 232 = (28)4 = 2564, so if we write any number N in base 256, then (N mod 232) contains only the last four digits. (It may be easier to see this in decimal: if N is a number written in base 10, then (N mod 104) contains only the last four digits of the number.) If D = 128, a similar phenomenon will occur since 232 is a multiple of 128 as well.
If we don't want to change M, then we need to choose a value of D that yields good hashing behavior, i.e. produces values with a roughly uniform distribution. D should be greater than or equal to the largest ordinal value of any character in the input string. If M is a power of two, then a prime value of D may work well, but it is probably best to avoid primes such as 257 that are close to a power of two. I recommend using D = 1,000,003, a prime number that has worked well in my own experiments (and is actually used in Python's own hash implementation). A further investigation of which other values of D would work well would require a bit of number theory and possibly some experimentation.
Alternatively we can keep D equal to a power of two (e.g. 216 for 16-bit characters) and choose a different value of M. In this case it is probably safest if M is prime and not too close to a power of 2.
A hash table is a data structure used to implement either a set of keys, or a map from keys to values. A hash table is often more efficient than a binary tree (which we can also be used to implement a set or map, as we saw recently). Hash tables are simple, and do not require complex code to stay balanced as is the case for binary trees. For this reason, hash tables are very widely used, probably even more so than binary trees for storing arbitrary maps from keys to values.
The most common method for implementing a hash table is chaining, in which the table contains an array of buckets. Each bucket contains a hash chain, which is a linked list of keys (or key/value pairs in the case of a map).
In Python, here is how we might implement a hash table representing a set of objects:
class Node: def __init__(self, key, next): self.key = key self.next = next class HashSet: def __init__(self, numBuckets): # each array element is the head of a linked list of Nodes self.a = numBuckets * [None]
In some hash table implementations, the array of buckets has a fixed size. In others, it can expand dynamically. For the moment, we will assume that the number of buckets is a constant B.
A hash table requires a hash function h(k) that can map each key to a bucket number. Typically we use a preexisting hash function h1 that maps keys to larger integers, then let h(k) = h1(k) mod B. If a key k is present in a hash table, it is always stored in the hash chain a[h(k)]. In other words, the hash function tells us which bucket a key belongs in.
Let's expand the HashSet
class with a method contains
that checks whether a value
is present in a HashSet
,
and a method add
that
adds a value if not already present:
def contains(self, x): b = hash(x) % len(self.a) p = self.a[b] while p != None: if p.val == x: return True p = p.next return False def add(self, x): if not self.contains(x): b = hash(x) % len(self.a) self.a[b] = Node(x, self.a[b]) # prepend to hash chain
We haven't implemented a remove
method here to delete
a key
from a hash table, but that would
be straightforward. To delete a key, we simply find
its node in
the hash bucket that contains it,
and delete the node from the contaning linked list.
Suppose that a hash table has N key/value pairs in B buckets. Then its load factor α is defined as α = N / B. This is the average number of key/value pairs per bucket, i.e. the average length of each hash chain.
Suppose that our hash function distributes keys evenly among buckets. Then any lookup in a hash table that misses (either a get request for a key that is absent, or setting the value of a new key) will effectively be choosing a bucket at random. So it will examine α buckets on average as it walks the bucket's hash chain to look for the key. This shows that such lookups run in time O(α) on average, independent of N.
The analysis is a bit trickier for lookups that hit, since these are more likely to search a bucket that has a longer hash chain. Nevertheless it can also be shown that these run in time O(α) on average.
So we can make hash table lookups arbitrarily fast (on average) by keeping α small, i.e. by using as many hash buckets as needed. Of course, this supposes that we know in advance how many items we will be storing in a hash table, so that we can preallocate an appropriate number of buckets.
However, even if that number is not known, we can dynamically grow a hash table whenever its load factor grows above some fixed limit. To grow the table, we allocate a new bucket array, typically twice the size of the old one. Then we loop over all the nodes in the buckets in the old array, and insert them into the new array. We recompute each key's hash value to find its position in the new array, which may not be the same as in the previous, smaller array.
Suppose that we start with an empty hash table and insert N values into it, doubling the number of buckets whenever the load factor exceeds some fixed value α0. Then how long will it take to insert the N values, as a function of N?
If we exclude the time for the doubling operations, then each insertion operation will run in O(1). That's because each insertion will run in O(α) (since it must traverse an existing hash chain to check whether the value being inserted is already present), and α will always be less than the constant α0.)
Now let's consider the time spent growing the table. The time to perform each doubling operation is O(M), where M is the number of elements in the hash table at the moment we perform the doubling. That's because we must rehash M elements, and each rehash operation takes O(1) since we can compute a hash value and prepend an element to a linked list in constant time. Suppose that the hash table initially contains k buckets. Then we will perform the first doubling operation when there are (kα0) values in the table. Let's also suppose that we perform the last doubling operation as we insert the last (Nth) item into the table. Then the total time for all the doubling operations will be
O( kα0
+ 2kα0
+ 4kα0
+ … + N)
= O(kα0 · (1 + 2 + 4 + … + N / kα0))
=
O(1 + 2 + 4 + … + N / kα0)
=
O(N / kα0)
=
O(N)
So we can insert N elements in O(N), which means that insertion takes O(1) on average, even as the hash table grows arbitrarily large.
When we write a class in Python, by default Python considers two objects of our class to be equal only if they actually are the same object. For example:
class Point: def __init__(self, x, y): self.x = x self.y = y >>> x = Point(3, 4) >>> y = Point(3, 4) >>> x == y False >>>
Python's default hash function uses this same notion of equality. If x and y are not the same object, then they will generally have different hash values:
>>> hash(x) 8743680942113 >>> hash(y) 8743680942109
We may often want a different notion of equality. As we have seen, we can implement the magic method __eq__ to define how equality works for our class:
def __eq__(self, p): return self.x == p.x and self.y == p.y
If we define __eq__, but continue to use Python's default hash
function, then objects of our class will not work reliably as
keys in Python sets and dictionaries, which are implemented
internally as hash tables. For example, suppose
that we define x
= Point(3, 4)
and add
x
to
a Python set. If
we later define y =
Point(3, 4)
and look
up y
in
the set, it may not be found. That's because y
has a different hash code, so the
lookup may not look in the hash bucket where x
was stored.
In other words, for a hash table to work, two equal values must always have the same hash code. And so if you redefine __eq__, you must always also redefine __hash__, which is the magic method that produces a hash code for any Python object.
For our Point class, we might do that as follows:
def __hash__(self): return hash((self.x, self.y))
This implementation constructs the pair (x, y), then asks Python to compute a hash code for the pair. This works because two pairs with the same numbers are equal and will hence have the same hash code.
(By the way, this works similarly in other object-oriented languages such as C# or Java: when you redefine equality for a user-defined class in those languages, you must also define a custom hash code function for your class.)