Lecture 7

Here are notes about topics we covered in lecture 7.

For more details about exceptions, see the Essential C# textbook or the C# reference pages.

For more details about hash tables, see Introduction to Algorithms, ch. 11 "Hash Tables", sections 11.1 – 11.3. (You can find many copies of this book in the library here in the MFF computer science building.)

throw

The throw statement throws an exception, which can be any object belonging to the System.Exception class or any of its subclasses. The exception will pass up the call stack, aborting the execution of any methods in progress until it is caught with a try...catch block at some point higher on the call stack. If the exception is not caught, the program will terminate.

try

The try statement attempts to execute a block of code. It may have a set of catch clauses and/or a finally clause.

A catch clause catches all exceptions of a certain type. For example:

  static void Main() {
    StreamReader reader;
    try {
      reader = new StreamReader("numbers");
    } catch (FileNotFoundException e) {
      WriteLine("can't find input file: " + e.FileName);
      return;
    }
    

The code above will catch an exception of class FileNotFoundException, or of any subclass of it. When an exception is caught, the catch block (called an exception handler) executes. The catch block may itself rethrow the given exception, or even a different exception. If the catch block does not throw an exception, execution resumes below the try statement.

A finally clause will always execute, even if an exception is thrown inside the body of the try statement. For example:

  StreamReader reader = ;
  StreamWriter writer = ;
  try {
    while (reader.ReadLine() is string s)
      writer.WriteLine(transform(s));
  } finally {
    reader.Close();
    writer.Close();
  }

In this close, reader and writer will be closed even if an exception occurs within the try body (for example, within the transform method). Note that a finally clause does not itself catch an exception, which will continue to pass up the call stack.

The preceding example is equivalent to

  StreamReader reader = ;
  StreamWriter writer = ;
  try {
    while (reader.ReadLine() is string s)
      writer.WriteLine(transform(s));
  } catch (Exception e) {
    reader.Close();
    writer.Close();
    throw e;
  }

hash tables

A hash table is a data structure used to implement a map from keys to values. A hash table can often be more efficient than a binary tree, which can also be used to implement a key/value map (as we saw in Programming I). Hash tables are simple, and do not generally require complex code to stay balanced as is the case for binary trees. For this reason, hash tables are very widely used, probably even more so than binary trees for storing arbitrary maps from keys to values.

In these notes we will show how to build a hash table from strings (keys) to integers (values). (As with other data structures, it is not difficult to generalize this structure to map any sort of key to any sort of value.)

The most common method for implementing a hash table is chaining, in which the table contains an array of buckets. Each bucket is a linked list of key/value pairs. So we can represent a hash table in C# like this:

class Node {
  string key;
  int val;
  Node next;
}

class HashTable {
  Node[] a;   // each array element is the head of a linked list
  
}

In some hash table implementations, the array of buckets has a fixed size. In others, it can expand dynamically. Here, we will assume that the array size is a constant M.

A hash table requires a hash function h(k) that can map each key to an integer in the range 0 .. (M – 1). If a key/value pair (k, v) is present in a hash table, it is always stored in the hash chain a[h(k)]. In other words, the hash function tells which bucket a key belongs in.

With the hash function h, here is pseudocode implementing the functions to get and set values in a hash table:

int get(string key) {
  int b = h(key);  // call hash function to get bucket number
  look in the hash chain a[b] for a node N whose key equals 'key'
  if found, return N.val
  else throw exception "key not found"
}

void set(string key, int val) {
  int b = h(key);  // call hash function to get bucket number
  look in the hash chain a[b] for a node N whose key equals 'key'
  if found, set N.val = val
  else prepend a new node (key, val) to the hash chain a[b]
}

There are many different hash functions for mapping strings (or other kinds of data) to integers. These functions vary in their robustness and efficiency. In general, a good hash function will distribute keys roughly equally among buckets, and will almost always choose different buckets even for two keys that are similar.

Here is a terrible hash function for strings. Again, M is the number of buckets:

  int hash(string s) => s[0] % M;

This hash function is poor because it only uses the first character of the string, so all strings with the same first character will end up in the same bucket. In some applications, many or even all strings may begin with the same first character, so this function may often not distribute this strings evenly among buckets.

Here is one way to construct a better hash function. Given any string s, consider it as a series of digits in base 65,536 (= 216) forming one large number N. Now compute N mod M. This is not difficult, because if we compute any large number N using additions and multiplications, taking the result mod M at each step along the way, then the result will in fact be the value (N mod M):

  int hash(string s) {  // return a value from 0 .. M - 1
    int h = 0;
    foreach (char c in s)
      h = (h * 65536 + c) % M;
    return h;
  }

This hash function will still be terrible for certain values of M, in particular if M is a power of 2. For example, suppose that M = 65536. Then if we interpret a string s as a large number N, then (N mod M) is in fact just the last character of the string! Or if M = 256, then (N mod M) will be the lower eight bits of the string's last character. So once again the hash function will only depend on a single character in the string.

The function above will generally behave well for prime values of M that are not too close to a power of 2. (Investigating the reasons for that would require a bit of number theory and more time than we have in this course, however.)

Suppose that a hash table has N key/value pairs in M buckets. Then its load factor α is defined as α = N / M. This is the average number of key/value pairs per bucket.

Suppose that our hash function distributes keys evenly among buckets. Then any lookup in a hash table that misses (either a get request for a key that is absent, or setting the value of a new key) will effectively be choosing a bucket at random. So it will examine α buckets on average as it walks the bucket's hash chain to look for the key. This shows that such lookup run in time O(α), independent of N.

The analysis is a bit trickier for lookups that hit, since these are more likely to search a bucket that has a longer hash chain. Nevertheless it can also be shown that these run in time O(α) (for instance, see Introduction to Algorithms, section 11.2.)

So this shows that we can make hash table lookups arbitrarily fast (on average) by keeping α small, i.e. by using as many hash buckets as needed. Of course, this presupposes that we know in advance how many items we will be storing in a hash table, so that we can preallocate an appropriate number of buckets.

However, even if that number is not known, we can dynamically grow a hash table by expanding its bucket array when the load factor grows too large, just as we did for dynamic arrays. If we do this and double the bucket array size on each expansion, then the time to add N elements to the hash table will still be O(α N) if we keep the load factor under α, i.e. O(α) on average for each insertion. Because we can choose α to be a constant that is as small as we like, people often write that hash table insertions or deletions are "O(1) on average".