Introduction to Algorithms, 2020-1
Week 9: Notes

Some of this week's topics are covered in Problem Solving with Algorithms:

And in Introduction to Algorithms:

Here are some additional notes.

sets

We have already seen stacks and queues, two abstract data types. We've seen that it's possible to implement these abstract types using various concrete data structures. For example, we can implement a stack or queue either using an array or a linked list.

We'll now introduce another abstract data type. A set provides the following operations:

s.add(value)
Add a value to a set.
s.remove(value)
Remove a value from a set.
s.contains(value)
Test whether a value is present in a set, returning True or False.

A set cannot contain the same value twice: every value is either present in the set or it is not.

This type should be familiar, because it has a default implementation in Python that we have seen and used.

In this algorithms course, we will study various ways to implement sets. We'd like to understand how we could build efficient sets in Python if they were not already provided in the standard library.

binary search trees

A binary search tree is a binary tree in which the values are ordered in a particular way that makes searching easy: for any node N with value v,

Here is a binary search tree of integers:

tree

We can use a binary search tree to store a set supporting the add, remove, and contains operations that we described above. To do this, we'll write a TreeSet class that holds the current root of a binary tree:

class TreeSet:
  def __init__(self):
    self.root = None

contains

It is not difficult to find whether a binary tree contains a given value k. We begin at the root. If the root's value is k, then we are done. Otherwise, we compare k to the root's value v. If k < v, we move to the left child; if k > v, we move to the right child. We proceed in this way until we have found k or until we hit None, in which case k is not in the tree. Here's how we can implement this in the TreeSet class:

def contains(self, x):
    n = self.root
    while n != None:
      if x == n.val:
        return True
      if x < n.val:
        n = n.left
      else:
        n = n.right
    return False

add

Inserting a value into a binary search tree is also pretty straightforward. Beginning at the root, we look for an insertion position, proceeding down the tree just as in the above algorithm for contains. When we reach an empty left or right child, we create a node there. In the TreeSet class:

# add a value, or do nothing if already present
  def add(self, x):
    n = self.root
    if not n:
      self.root = Node(x, None, None)
      return
      
    while n.val != x:
      if x < n.val:
        if n.left:
          n = n.left
        else:
          n.left = Node(x, None, None)
          break
      elif x > n.val:
        if n.right:
          n = n.right
        else:
          n.right = Node(x, None, None)
          break

remove

Deleting a value from a binary search tree is a bit trickier. It's not hard to find the node to delete: we just walk down the tree, just like when searching or inserting. Once we've found the node N we want to delete, there are several cases.

  1. If N is a leaf (it has no children), we can just remove it from the tree.

  2. If N has only a single child, we replace N with its child. For example, we can delete node 15 in the binary tree above by replacing it with 18.

  3. If N has two children, then we will replace its value by the next highest value in the tree. To do this, we start at N's right child and follow left child pointers for as long as we can. This wil take us to the smallest node in N's right subtree, which must be the next highest node in the tree after N. Call this node M. We can easily remove M from the right subtree: M has no left child, so we can remove it following either case (a) or (b) above. Now we set N's value to the value that M had.

    As a concrete example, suppose that we want to delete the root node (with value 10) in the tree above. This node has two children. We start at its right child (20) and follow its left child pointer to 15. That’s as far as we can go in following left child pointers, since 15 has no left child. So now we remove 15 (following case b above), and then replace 10 with 15 at the root.

We won't give an implementation of this operation here, but writing this yourself is an excellent (and somewhat challenging) exercise.

running time of binary search tree operations

It is not difficult to see that the add, remove and contains operations described above will all run in time O(h), where h is the height of a binary search tree. What is their running time as a function of N, the number of nodes in the tree?

First consider a complete binary search tree. As we saw above, if the tree has N nodes then its height is h = log2(N + 1) – 1 ≈ log2(N) – 1 = O(log N). So add, remove, and contains will all run in time O(log N).

Even if a tree is not complete, these operations will run in O(log N) time if the tree is not too tall given its number of nodes N – specfically if its height is O(log N). We call such a tree balanced.

Unfortunately not all binary trees are balanced. Suppose that we insert values into a binary search tree in ascending order:

t = TreeSet()
for i in range(1, 1001):
  t.add(i)

The tree will look like this:

tree

This tree is completely unbalanced. It basically looks like a linked list with an extra None pointer at every node. add, remove and contains will all run in O(N) on this tree.

How can we avoid an unbalanced tree such as this one? There are two possible ways. First, if we insert values into a binary search tree in a random order then that the tree will almost certainly be balanced. We will not prove this fact here (you might see a proof in the Algorithms and Data Structures class next semester).

Unfortunately it is not always practical to insert in a random order – for example, we may be reading a stream of values from a network and may need to insert each value as we receive it. So alternatively we can use a more advanced data structure known as a self-balancing binary tree, which automatically balances itself as values are inserted. Two examples of such structures are red‑black trees and AVL trees. We will not study these in this course, but you will see them in Algorithms and Data Structures next semester. For now, you should just be aware that they exist.

hash functions

A hash function maps values of some type T to integers in a fixed range. Hash functions are commonly used in programming and in computer science in general.

We can construct hash functions in various ways. A good hash function will produce values that are roughly uniformly distributed among its output range in practice, even when input values are similar to each other.

In general, there may be many more possible values of T than integers in the output range. This means that hash functions will inevitably map some distinct input values to the same output value; this is called a hash collision. An ideal hash function will produce hash collisions in practice no more often than would be expected if it were producing random outputs.

Suppose that we want a hash function that takes strings of ASCII characters (i.e. with ordinal values from 0 to 127) and produces 32-bit hash values in the range 0 ≤ v < 232. Here is a terrible hash function for that purpose:

  def hash(s):
      return ord(s[0])

This hash function is poor because it only uses the first character of the string, so all strings with the same first character will end up with the same hash value. Furthermore, the function will produce only a small subset of the possible values in the output range. If the strings contain ASCII characters, then each character's ordinal value is at most 128.

As another idea, we could add the ordinal values of all characters in the input string:

  def hash(s):
      return sum(ord(c) for c in s)

This is also a poor hash function. If the input strings are short, then the output values will always be small integers. Furthermore, two input strings that contain the same set of characters (a common occurrence) will hash to the same number.

Here is one way to construct a better hash function, called modular hashing. Given any string s, consider it as a series of digits forming one large number N. For example, if the string contains ASCII characters, then these will be digits in base 128. Now we compute N mod M for some constant M, producing a hash value in the range from 0 to M – 1. In the computation, it is best to take the result mod M as we process each digit. That produces the same result as if we performed a single modulus operation at the end, but is more efficient because we avoid generating large integers:

  # Given a string of characters whose ordinal values range from 0 to (D – 1),
  # produce a hash value in the range 0 .. M – 1.
  def hash(s):
    h = 0
    for c in s:
      h = (h * D + ord(c)) % M
    return h 

(Recall that in an earlier lecture we saw how to combine digits of a number in any base. We are performing the same operation here, only modulo M.)

This hash function will still be terrible if M is a multiple of D. For example if we let D = 256 (since we have a string of 8-bit characters) and M = 232,, then the hash code's output will only depend on the last 4 characters in each string! That's because 232 = (28)4 = 2564, so if we write any number N in base 256, then (N mod 232) contains only the last four digits. (It may be easier to see this in decimal: if N is a number written in base 10, then (N mod 104) contains only the last four digits of the number.) If D = 128, a similar phenomenon will occur since 232 is a multiple of 128 as well.

If we don't want to change M, then we need to choose a value of D that yields good hashing behavior, i.e. produces values with a roughly uniform distribution. D should be greater than or equal to the largest ordinal value of any character in the input string. If M is a power of two, then a prime value of D may work well, but it is probably best to avoid primes such as 257 that are close to a power of two. I recommend using D = 1,000,003, a prime number that has worked well in my own experiments (and is actually used in Python's own hash implementation). A further investigation of which other values of D would work well would require a bit of number theory and possibly some experimentation.

Alternatively we can keep D equal to a power of two (e.g. 216 for 16-bit characters) and choose a different value of M. In this case it is probably safest if M is prime and not too close to a power of 2.