Introduction to Algorithms, 2022-3
Week 9: Notes

dynamic arrays

Let's consider another abstract data type, namely a dynamic array. A dynamic array d has these operations:

d.add(x): Append a value to the array.
d.get(i): Retrieve the value at the given index.
d.set(i, x): Set the value at the given index.; This interface probably looks familiar, since a Python list is a dynamic array!; In lower-level languages such as C an array typically has a fixed size. Let's now implement a dynamic array in Python, using only fixed-size arrays. That means that in our implementation we can allocate an array using an expression such as '10 * [None]', but we may not call the append() method to expand an existing array.; Here is one possible implementation:

class DynArray:
    def __init__(self):
        self.a = 10 * [None]
        self.n = 0              # current number of elements

    def add(self, x):
        if self.n >= len(self.a):   # the fixed-size array is full
            b = (len(self.a) + 10) * [None]
            for i in range(len(self.a)):
                b[i] = self.a[i]
            self.a = b
        self.a[self.n] = x
        self.n += 1
    
    def get(self, i):
        return self.a[i]
    
    def set(self, i, x):
        self.a[i] = x

The class has two attributes: a fixed-size array a plus an attribute n that indicates how many elements in the array are currently in use. Each time we add an element to the dynamic array, we first check if the underlying array a is full. If so, we allocate a new array b that is 10 elements larger, copy all elements from a to b, then set a to be the new array b.

Now suppose that we create a DynArray and add N elements in a loop:

d = DynArray()
for i in range(N):
    d.add(i * i)

How long will this take, algorithmically speaking?

The total time will be time for all the resize operations, plus the extra work done by the last two lines in the add() method. The extra work takes O(1) per element added, so the total time of the extra work is O(N).

A single resize operation from size S to size S + 10 must allocate an array of size S + 10, then copy S elements to the new array. Both of these operations take time O(S), so the total time for a single resize operation from size S is O(S). Then the total time for all the resize operations will be

O(10 + 20 + 30 + … + N) = O(1 + 2 + 3 + … + (N / 10)) = O((N / 10)²) = O(N²).

So the total time for the loop to run will be O(N) + O(N²) = O(N²). And so the average time for each of the N calls to add() will be O(N). That's pretty slow.

We can make a tiny change to the add() method to dramatically improve the algorithmic running time. When we grow the array, instead of adding a constant number of elements, let's double its size. The add() method will now look like this:

 def add(self, x):
        if self.n >= len(self.a):   # the fixed-size array is full
            b = (len(self.a) * 2) * [None]  # allocate new array, twice as big
            for i in range(len(self.a)):
                b[i] = self.a[i]
            self.a = b
        self.a[self.n] = x
        self.n += 1

Now let's reconsider the running time of the loop we wrote above. Here is its again:

d = DynArray()
for i in range(N):
    d.add(i * i)

The total time for all the resize operations will now be

O(10 + 20 + 40 + 80 + … + N) = O(1 + 2 + 4 + 8 + … + (N / 10)) = O(N / 10) = O(N)

And so the average time for each of the N calls to add() will be O(1). That's an enormous improvement.

In fact Python's built-in list type is implemented in a similar fashion. That is why calls to Python's append() method run in O(1) on average. (If you really want to understand how Python lists work, you can study the source code for lists inside the CPython interpreter, which is written in C.)

hash functions

A hash function maps values of some type T to integers in a fixed range. Often we will want a hash function that produces values in the range 0 .. (N – 1), where N is a power of two. Hash functions are very useful, and are a common building block in programming and also in theoretical computer science.

In general, there may be many more possible values of T than integers in the output range. This means that hash functions will inevitably map some distinct input values to the same output value; this is called a hash collision. A good hash function will produce relatively few collisions in practice. In other words, even if two input values are similar to each other, they should be unlikely to have the same hash value. An ideal hash function will produce hash collisions in practice no more often than would be expected if it were producing random outputs.

As a first example, let's think about how to construct a hash function that maps an integer of any size, i, to a hash value in the range 0 .. (N – 1), for some fixed value of N such as N = 2³². There is an obvious choice for the function: we can take i mod N.

As a next example, let's build a hash function that maps a pair of integers (i, j) to a hash value in the range 0 .. (N – 1). We might first consider the hash function (i + j) mod N. However, this is a poor choice. For example, (0, 2), (1, 1), and (2, 0) will all have the same hash value, and it's easy to imagine that we will encounter all of these input values in practice.

A better choice of hash function would be (Ki + j) mod N for some constant K. However, some values of K will be better than others. For example, suppose that N = 2³², and we choose the constant K = 2³⁰. Then (0, 0), (4, 0), and (8, 0) will all have the same hash value 0, since

(4 · 2³⁰) mod 2³² = 2³² mod 2³² = 0

and

(8 · 2³⁰) mod 2³² = 2³³ mod 2³² = 0

In fact, with this hash function any pair (i, 0) will always have a hash value that is 0, 1, 2, or 3, even though these are only 4 out of 2³² possible output values for the hash function in general!

In general, we will get the best results if K and N are relatively prime, i.e. they have no prime factors in common. Furthermore, we probably don't want K to be too small, to avoid hash collisions for small values of i and j. As one example, the prime number 1,000,003 could be a reasonable choice for K.

hashing strings

In practice, we will very often want a hash function that takes strings as input. Suppose that we want a hash function that takes strings of characters with ordinal values from 0 to 255 (i.e. with no fancy Unicode characters) and produces 32-bit hash values in the range 0 ≤ v < 2³².

As one idea, we could add the ordinal values of all characters in the input string:

def hash(s):
    return sum([ord(c) for c in s])

This is a poor hash function. If the input strings are short, then the output values will always be small integers. Furthermore, two input strings that contain the same set of characters (a very common occurrence) will hash to the same number.

Here is one way to construct a better hash function, called modular hashing. Given any string s, consider it as a series of digits forming one large number H. For example, if our characters have ordinal values in the range from 0 to 255, we can imagine them to be digits in base 256. Then we can compute H mod N for some constant N, producing a hash value in the range from 0 to N – 1.

Here is Python code that implements this idea:

# Generate a hash code in the range 0 .. 2^32 – 1
# DO NOT USE – this hash function is poor (see text below)
def my_hash(s):
    h = 0
    for c in s:
        d = ord(c)    # 0 .. 255
        h = 256 * h + d
    return h % (2 ** 32)

As you can see, this code is using the algorithm for combining digits in any base that we learned in one of the very first lectures of this class.

Unfortunately, this hash function is still poor. We can see that it often maps similar strings to the same hash value:

>>> my_hash('bright')
1768384628
>>> my_hash('light')
1768384628
>>> my_hash('night')
1768384628

The problem is that if we have a number H in base 256, then H mod 2³² is exactly the last four digits of the number, because 2³²= (2⁸)^4 = 256⁴. If that's not obvious, consider the same phenomenon in base 10: 2276345 mod 10000 = 6345, because 1000 = 10⁴. And so this hash function only depends on the last four characters in the string.

More generally, if B is the base of our digits (e.g. B = 256 here), and N is the size of the output range (e.g. N = 2³² here), then we will probably get the best hash behavior if B and N are relatively prime. So if we want a better hash function, we must change B or N. Assuming that we want to produce values in a certain given range, we'd like to keep N as it is. So let's change B. In fact it will probably be best if B is not too close to a power of two (for number-theoretic reasons that we won't go into here). A good choice for B might be the prime number 1,000,003 that we saw above. Let's modify our hash function to use it:

# Generate a hash code in the range 0 .. 2^32 - 1
def my_hash(s):
    h = 0
    for c in s:
        d = ord(c)    # 0 .. 255
        h = 1_000_003 * h + d
    return h % (2 ** 32)

Now we get distinct values for the strings we saw above:

>>> my_hash('bright')
2969542966
>>> my_hash('light')
1569733066
>>> my_hash('night')
326295660

To be clear, we are now considering the input string to be a series of digits in base 1,000,003! This also means that our input string can now reasonably contain any Unicode characters that we like.

A disadvantage of the function above is that it computes an integer h that will be huge if the input string is large, since it encodes all of the characters in the string. That may be inefficient, and in fact many programming languages don't support large integers of this sort.

However, we can make a tiny change to the code so that it will compute the same output values, but be far more efficient. Rather than taking the result mod 2³² at the end, we can perform it at every step of the calculation:

# Generate a hash code in the range 0 .. 2^32 - 1
def my_hash(s):
    h = 0
    for c in s:
        d = ord(c)    # 0 .. 255
        h = (1_000_003 * h + d) % (2 ** 32)
    return h

This function computes the same hash values that the previous version did:

>>> my_hash('bright')
2969542966
>>> my_hash('light')
1569733066
>>> my_hash('night')
326295660

Why does this trick work? A useful mathematical fact is that if you're performing a series of additions and multiplications and you want the result mod N, you can actually perform a (mod N) operation at any step along the way, and you will still get the same result. I won't prove this statement here, but ultimately it is true because of this fact, which you might see in a number theory course: if a ≡ b (mod N) and c ≡ d (mod N), then a + c ≡ b + d (mod N) and ac ≡ bd (mod N). In any case, this is especially useful in lower-level languages such as C that have fixed-size integers, because arithmetic operations in those languagues automatically compute the result mod N for some fixed N (where typically N = 2³² or N = 2⁶⁴).

hash tables

A hash table is a data structure used to implement either a set of keys, or a map from keys to values. A hash table is often more efficient than a binary tree (which we can also be used to implement a set or map, as we saw recently). Hash tables are simple, and do not require complex code to stay balanced as is the case for binary trees. For this reason, hash tables are very widely used, probably even more so than binary trees for storing arbitrary maps from keys to values.

One common method for implementing a hash table is chaining, in which the table contains an array of buckets. Each bucket contains a hash chain, which is a linked list of keys (or key/value pairs in the case of a map). For example, here is a picture of a small hash table with 4 buckets that stores a set of string keys:

In some hash table implementations, the array of buckets has a fixed size. In others, it can expand dynamically. For the moment, we will assume that the number of buckets is a constant B.

A hash table requires a hash function h(k) that can map each key to a bucket number. Typically we choose h(k) = h₁(k) mod B, where h₁ is a preexisting hash function that maps keys to larger integers (such as the my_hash function we wrote above). If a key k is present in a hash table, it is always stored in the hash chain in bucket h(k). In other words, the hash function tells us which bucket a key belongs in.

With this structure, it is straightforward to implement the set operations contains(), add(), and remove(). contains(x) walks down the hash chain in bucket h(x), looking for a node with value x. If it finds such a node in the chain, it returns True, otherwise False. add(x) will first call contains(x) to check whether the value to be added is already present in the table. If it if not, it will prepend a new node with value x to the hash chain in bucket h(x). remove(x) will look for a node with value x in the hash chain in bucket hash(x). If it finds such a node, it will delete it from the linked list.

Here is a partial implementation in Python of a hash table representing a set of objects. The remove() method is left as an exercise for you.

class Node:
    def __init__(self, key, next):
        self.key = key
        self.next = next

class HashSet:
    def __init__(self, num_buckets):
        # each array element is the head of a linked list of Nodes
        self.a = num_buckets * [None]

    def contains(self, x):
        i = hash(x) % len(self.a)     # hash bucket index
        p = self.a[i]
        while p != None:
            if p.val == x:
                return True
            p = p.next
        return False

    def add(self, x):
        if not self.contains(x):
            i = hash(x) % len(self.a)
            self.a[i] = Node(x, self.a[i])   # prepend to hash chain

It is straightforward to extend this hash set implementation to store a map from keys to values: all we have to do is store a key/value pair in each node, just like when we use a binary search tree to store a map.

We'd now like to consider the running time of hash table operations. Suppose that a hash table has N nodes in B buckets. Then its load factor α is defined as α = N / B. This is the average number of nodes per bucket, i.e. the average length of each hash chain.

Suppose that our hash function distributes keys evenly among buckets. Then any lookup in a hash table that misses (e.g. a contains() request for a key that is absent) will effectively be choosing a bucket at random. So it will examine α buckets on average as it walks the bucket's hash chain to look for the key. This shows that such lookups run in time O(α) on average, independent of N.

The analysis is a bit trickier for lookups that hit, since these are more likely to search a bucket that has a longer hash chain. Nevertheless it can also be shown that these run in time O(α) on average if the hash function distributes keys evenly. In fact all of our hash set operations (add, contains, remove) will run in O(α) on average.

So we can make hash table operations arbitrarily fast (on average) by keeping α small, i.e. by using as many hash buckets as needed. Of course, this supposes that we know in advance how many items we will be storing in a hash table, so that we can preallocate an appropriate number of buckets.

However, even if that number is not known, we can dynamically expand a hash table whenever its load factor grows above some fixed limit α₀. To grow the table, we allocate a new bucket array, typically twice the size of the old one. Then we loop over all the nodes in the buckets in the old array, and insert them into the new array. We must recompute each key's hash value to find its position in the new array, which may not be the same as in the previous, smaller array.

Suppose that we start with an empty hash table and insert N values into it, doubling the number of buckets whenever the load factor exceeds some fixed value α₀. Then how long will it take to insert the N values, as a function of N?

If we exclude the time for the doubling operations, then each insertion operation will run in O(1). That's because each insertion will run in O(α) (since it must traverse an existing hash chain to check whether the value being inserted is already present), and α will always be less than the constant α₀.)

Now let's consider the time spent growing the table. The time to perform each doubling operation is O(M), where M is the number of elements in the hash table at the moment we perform the doubling. That's because we must rehash M elements, and each rehash operation takes O(1) since we can compute a hash value and prepend an element to a linked list in constant time. If there are initially k buckets in the hash table, then the first doubling operation will double the size to (2k). Considering the end of the process, the last doubling operation will double the size of the hash table to N. The the second-to-last operation will double it to (N / 2), the operation before that will double it to (N / 4), and so on. The total doubling time will be

O(k) + O(2k) + … + O(N / 4) + O(N / 2) + O(N)

≤ O(1 + 2 + 4 + … + N / 4 + N / 2 + N)

= O(N)

So we can insert N elements in O(N), which means that insertion takes O(1) on average, even as the hash table grows arbitrarily large.

Introduction to Algorithms, 2022-3 Week 9: Notes

dynamic arrays

hash functions

hashing strings

hash tables

Introduction to Algorithms, 2022-3
Week 9: Notes