Week 10: Notes

lower bounds for sorting

We have studied 6 different sorting algorithms in this class! Three of them (bubble sort, selection sort, insertion sort) run in O(N2) time in the worst case, and three of them (merge sort, quick sort, bubble sort) typically run in O(N log N). All six of these algorithms are comparison-based, i.e. they work by making a series of comparisons between values in the input array.

We will now ask: is there any sorting algorithm that can sort N values in time better than O(N log N) in the worst case? We will show that for any comparison-based sorting algorithm, the answer is no: O(N log N) is the fastest time that is possible.

Any comparison-based sorting algorithm makes a series of comparisons, which we may represent in a decision tree. Once it is has performed enough comparisons, it knows how the input values are ordered.

For example, consider an input array with three elements a, b, and c. One possible comparison-based algorithm might compare the input elements following this decision tree:

The algorithm first compares a and b. If a < b, it then compares a and c. If c < a, then the algorithm now knows a complete ordering of the input elements: it is c < a < b. And presumably as the algorithm runs it will rearrange the array elements into this order.

This algorithm will perform 2 comparisons in the best case. In the worst case, it will perform 3 comparisons, since the height of the tree above is 3.

Might it be possible to write a comparison-based algorithm that determines the ordering of 3 elements using only 2 comparisons in the worst case? The answer is no. With only 2 comparisons, the height of the decision tree will be 2, which means that it will have at most 22 = 4 leaves. However there are 3! = 6 possible permutations of 3 elements, so a decision tree with height 2 is not sufficient to distinguish them all.

More generally, suppose that we are sorting an array with N elements. There are N! possible orderings of these elements. If we want to distinguish all these orderings with k comparisons, then our decision tree will have at most 2k leaves, and so we must have

2k ≥ N!

or

k ≥ log2(N!)

We may approximate N! using Stirling's formula, which says that

N! ~ sqrt(2 π N) (N / e)N

And so we have

k >= log2(N!)

= log2(sqrt(2 π N) (N / e)N)

= log2(sqrt(2 π N)) + log2((N / e)N)

= 0.5 (log2(2) + log2(π) + log2(N)) + N (log2(N) – log2(e))

= O(N log N)

This shows that any comparison-based sorting algorithm will require O(N log N) comparisons in the worst case, and thus will run in time at least O(N log N).

representing graphs

A graph consists of a set of vertices (sometimes called nodes in computer science) and edges. Each edge connects two vertices. In a undirected graph, edges have no direction: two vertices are either connected by an edge or they are not. In a directed graph, each edge has a direction: it points from one vertex to another. A graph is dense if it has many edges relative to its number of vertices, and sparse if it has few such edges. Some graphs are weighted, in which case every edge has a number called its weight.

Graphs are fundamental in computer science. Many problems can be expressed in terms of graphs, and we can answer many interesting questions using graph algorithms.

Often we will assume that each vertex in a graph has a unique id. Sometimes ids are integers; in a graph with V vertices, typically these integers will range from 0 through (V - 1). Here is one such undirected graph:

Alternatively, we may sometimes assume that vertex ids are strings, or belong to some other data type.

How may we store a graph with V vertices in memory? As one possibility, we can use adjacency-matrix representation. Assuming that each vertex has an integer id, the graph will be stored as a two-dimensional array g of booleans, with dimensions V x V. In Python, g will be a list of lists. In this representation, g[i][j] is true if and only there is an edge from vertex i to vertex j.

If the graph is undirected, then the adjacency matrix will be symmetric: g[i][j] = g[j][i] for all i and j.

Here is an adjacency-matrix representation of the undirected graph above. Notice that it is symmetric:

graph = [
    [False, True,  True,  False, False, True,  True],
    [True,  False, False, False, False, False, False],
    [True,  False, False, False, False, False, False],
    [False, False, False, False, True,  True,  False],
    [False, False, False, True,  False, True,  True],
    [True,  False, False, True,  True,  False, False],
    [True,  False, False, False, True,  False, False]
]

Also notice that all the entries along the main diagonal are False. That's because the graph has no self-edges, i.e. edges from a vertex to itself.

Alternatively, we can use adjacency-list representation, in which for each vertex we store an adjacency list, i.e. a list of its adjacent vertices. Assuming that vertex ids are integers, then in Python this representation will generally take the form of a list of lists of integers. For a graph g, g[i] is the adjacency list of i, i.e. a list of ids of vertices adjacent to i. (If the graph is directed, then more specifically g[i] is a list of vertices to which there are edges leading from i.)

When we store an undirected graph in adjacency-list representation, then if there is an edge from u to v we store u in v’s adjacency list and also store v in u’s adjacency list.

Here is an adjacency-list representation of the undirected graph above:

graph = [
    [1, 2, 5, 6],    # neighbors of 0
    [0],             # neighbors of 1
    [0],             # …
    [4, 5],
    [3, 5, 6],
    [0, 3, 4],
    [0, 4]
]

Now suppose that vertex ids are not integers – for example, they may be strings, as in this graph:

Then in Python the adjacency-list representation will be a dictionary of lists. For example, the graph above has this representation:

graph = {
    'c': ['f', 'q', 'x'],
    'f': ['c', 'p', 'q', 'w'],
    'p': ['f'],
    'w': ['f', 'q', 'x'],
    'x': ['c', 'w']
}

Suppose that we have a graph with integer ids. Which representation should we use? Adjacency-matrix representation allows us to immediately tell whether two vertices are connected, but it may take O(V) time to enumerate a given vertex’s edges. On the other hand, adjacency-list representation is more compact for sparse graphs. With this representation, if a vertex has e edges then we can enumerate them in time O(e), but determining whether two vertices are connected may take time O(e), where e is the number of edges of either vertex. Thus, the running time of some algorithms may differ depending on which representation we use.

In this course we will usually represent graphs using adjacency-list representation.

depth-first search

In this course we will study two fundamental algorithms that can search a directed or undirected graph, i.e. explore it by visiting its nodes in some order.

The first of these algorithms is depth-first search, which is something like exploring a maze by walking through it. Informally, it works like this. We start at any vertex, and begin walking along edges. As we do so, we remember all vertices that we have ever visited. We will never follow an edge that leads to an already visited vertex. Each time we hit a dead end, we walk back to the previous vertex, and try the next edge from that vertex (as always, ignoring edges that lead to visited vertices). If we have already explored all edges from that vertex, we walk back to the vertex before that, and so on.

For example, consider this undirected graph that we saw above:

Suppose that we perform a depth-first search of this graph, starting with vertex 0. We might first walk to vertex 1, which is a dead end. The we backtrack to 0 and walk to vertex 2, which is also a dead end. Now we backtrack to 0 again and walk to vertex 5. From there we might proceed to vertex 3, then 4. From 4 we cannot walk to 5, since we have been there before. So we walk to vertex 6. From 6 we can't proceed to 0, since we have been there before. So we backtrack to 4, then back to 3, 5, and then 0. The depth-first search is complete, since there are no more edges that lead to unvisited vertices.

As a depth-first search runs, we may imagine that at each moment every one of its vertices is in three possible states:

Often when we draw pictures of a depth-first search (and also of some other graph algorithms), we draw undiscovered vertices in white, frontier vertices in gray and explored vertices in black.

For example, here is a picture of a depth-first search in progress on the graph above. The search started at vertex 0, went to vertex 1, then backtracked back to 0. Then it went to vertex 2 and backtracked back to 0 again. Then it proceeded to vertex 5, and has just arrived at vertex 3:

Here's a larger graph, with vertices representing countries in Europe. The graph contains edges between countries that either border each other or are reachable from each other by a direct ferry (in which case the edges are depicted as dashed lines):

map

Here's a picture of a depth-first search in progress on this graph. The search started with Austria, and has now reached Italy:

map

implementing a depth-first search

It is easy to implement a depth-first search using a recursive function. However, we must avoid walking through a cycle in an infinite loop. To accomplish this, as described above, when we first visit a vertex we mark it as visited. Whenever we follow an edge, if it leads to a visited vertex then we ignore it.

If a graph has vertices numbered from 0 to (V – 1), we may efficiently represent the visited set using an array of V boolean values. Each value in this array is True if the corresponding vertex has been visited.

Here is Python code that implements a depth-first search using a nested recursive function. It takes a graph g in adjacency-list representation, plus a vertex id 'start' where the search should begin. It assumes that vertices are numbered from 0 to (V 1), where V is the total number of vertices.

# recursive depth-first search over graph with integer ids
def dfs(g, start):
    visited = [False] * len(g)

    def visit(v):
        print('visiting', v)
        visited[v] = True
        for w in g[v]:           # for all neighbors of vertex v
            if not visited[w]:
                visit(w)

    visit(start)

This function merely prints out the visited nodes. But we can modify it to perform useful tasks. For example, to determine whether a graph is connected, we can run a depth-first search and then check whether the search visited every node. As you will see in more advanced courses, a depth-first search is also a useful building block for other graph algorithms: determining whether a graph is cyclic, topological sorting, discovering strongly connnected components and so on.

This depth-first search does a constant amount of work for each vertex (making a recursive call) and for each edge (following the edge and checking whether the vertex it points to has been visited). So it runs in time O(V + E), where V and E are the numbers of vertices and edges in the graph.

As another example, let's run a depth-first search on the undirected graph of Europe that we saw above. Here is a file 'europe' containing data that we can use to build the graph. The file begins like this:

austria: germany czech slovakia hungary slovenia italy
belgium: netherlands germany luxembourg france
bulgaria: romania greece
…

And here is a function that can read this file to build an adjacency-list representation in memory:

# read graph in adjacency list representation
def read_graph(file):
    g = {}

    def add(v, w):
        if v not in g:
            g[v] = []
        g[v].append(w)

    with open(file) as f:
        for line in f:
            source, dest = line.split(':')
            if source not in g:
                g[source] = []
            for c in dest.split():
                add(source, c)
                add(c, source)
    return g

Here is a function that will perform the depth-first search. Since vertex ids are strings, we can't represent the visited set using an array of booleans as we did above; instead, we use a Python set.

# recursive depth-first search over graph with arbitrary ids
def dfs(g, start):
    visited = set()

    def visit(v, depth):
        print('  ' * depth + v)
        visited.add(v)
        for w in g[v]:
            if w not in visited:
                visit(w, depth + 1)
    
    visit(start, 0)

The recursive helper function visit() uses an extra parameter 'depth' to keep track of the search depth. When it prints the id of each visited vertex, it indents the output by 2 spaces for each indentation level. Let's run the search:

>>> europe = read_graph('europe')
>>> dfs(europe, 'ireland')
ireland
  uk
    france
      belgium
        netherlands
          germany
            austria
              czech
                poland
                  lithuania
                    latvia
                      estonia
                        finland
                          sweden
                            denmark
                  slovakia
                    hungary
                      croatia
                        slovenia
                          italy
                            greece
                              bulgaria
                                romania
                            malta
            luxembourg
      spain
        portugal
>>> 

The indentation reveals the recursive call structure. We see that from the start vertex 'ireland' we recurse all the way down to 'denmark', then backtrack to 'poland' and begin exploring through 'slovakia', and so on.

Even with these changes to work with arbitary vertex ids, our depth-first search code still runs in O(V + E), since dictionary and set operations in Python (e.g. looking up an element in a dictionary, or checking whether an element is present in a set) run in O(1) on average.

Note that Python has a recursion limit of 1000 calls, so it would be unwise to run a recursive depth-first search in Python on a graph with more then 1000 vertices. As an alternative, it is also possible to implement a depth-first search non-recursively. We'll see how in the next lecture.