In the last lecture we discussed two fundamental graph search algorithms, namely depth-first search and breadth-first search. We also talked about searching in state spaces.
We have seen how to write a breadth-first search function that determines the shortest distance from the starting vertex to a given vertex in a graph. However we did not see how to produce an actual shortest path between vertices, so let's discuss that now.
As a review, here's a function that performs a breadth-first search in a graph using a queue:
def bfs(g, start): visited = set() q = Queue() q.enqueue(start) visited.add(start) while not q.is_empty(): v = q.dequeue() print(v) for w in g[v]: if w not in visited: q.enqueue(w) visited.add(w)
As it runs, a breadth-first search implicitly generates a breadth-first tree, in which each vertex points to all vertices that are first discovered from it as we search. Here is a breadth-first tree generated by searching our graph of European countries, starting with Austria:
In this picture, the dotted lines are not part of the breadth-first tree. They are edges which were not followed during the search, since they pointed to vertices that were already in the visited set.
The breadth-first tree contains a shortest path from the starting vertex ('austria') to every other vertex in the graph.
Let's now reverse all edges in the breadth-first tree. The result looks like this:
In this picture, each vertex points to its predecessor, namely the vertex from which we discovered it during the breadth-first search. Starting with any vertex v, we can follow arrows to reach the starting vertex. As we do so, we will be walking along a shortest path from v to the starting vertex.
We can modify our breadth-first search code so that as it runs, it remembers the predecessor of each vertex that it discovers. Then once we reach a goal vertex we can retrace a path back to the starting vertex. If we reverse this path, we will have a path from the start to the goal. Here is the updated code:
# Return a list of vertices along a shortest path from the given start vertex # to the given goal. If there is no such path, return None. def bfs_path(graph, start, goal): visited = set() pred = {} q = Queue() q.enqueue(start) visited.add(start) pred[start] = None # start vertex has no predecessor while not q.is_empty(): v = q.dequeue() if v == goal: path = [] while True: path.append(v) if pred[v] == None: break v = pred[v] path.reverse() return path for w in graph[v]: if w not in visited: q.enqueue(w) visited.add(w) pred[w] = v # remember w's predecessor return None # no path
Arithmetic expressions are composed from
numeric constants and from operators that act on those
numbers. We will use non-negative integer constants with the binary
operators +
, -
, *
and /
,
the last of which denotes integer division.
In traditional mathematical notation, these
operators are infix operators: they are written between the
values that they act on (which are called operands). For
example, we write 2 + 2
, or 8 - 7
. In this
last expression, 8 and 7 are the operands.
Here are some arithmetic expressions written using infix notation:
(4
+ 5) * (2 + 1)
(18
/
2)
-
1
In infix notation we may write parentheses to
distinguish expressions that would otherwise be ambiguous. For
example, 4 + 5 * 9
could mean either (4
+ 5) * 9
or 4 + (5 * 9)
. Another way to
disambiguate expressions is using operator precedence. For
example, *
is generally considered to have higher
precedence than +
. However, here we will assume that no
such precedence exists; we must use parentheses to disambiguate.
We may choose an alternative syntax that uses
prefix notation, in which we write each operator before its
operands. For example, we write + 2 4
to mean the
sum of 2 and 4, or / 8 2
to mean 8 divided by 2. Here
are the above expressions rewritten in prefix notation:
* + 4 5 + 2 1 - / 18 2 1
Or we may use postfix notation, in which operators come after
both operands: we might write 4 5 +
to mean the sum of 4
and 5. Here are the above expressions in postfix notation:
4 5 + 2 1 + * 18 2 / 1 -
In prefix or postfix notation there is no need either for parentheses
or operator precedence, because expressions are inherently
unambiguous. For example, the prefix expression * + 4 5 9
is equivalent to ((4 + 5) * 9)
, and the prefix
expression + 4 * 5 9
is equivalent to (4 + (5 *
9))
. There is no danger of confusing these in prefix form,
even without parentheses.
We will now discuss how to store and work with arithmetic expressions in a program.
We may store expressions using an expression tree, which is a form of abstract syntax tree. An expression tree reflects an expression’s hierarchical structure. For example, here is an expression tree for the infix expression ((3 + 4) * (2 + 5)):
Note that this expression tree also corresponds to the prefix expression * + 3 4 + 2 5, or the postfix expression 3 4 + 2 5 + *. Equivalent expressions in infix, prefix or postfix form will always have the same tree! That’s because this tree reflects the abstract syntax of the expression, which is independent of concrete syntax which defines how expressions are written as strings of symbols.
We can store
expression trees using
objects in Python.
We will have two classes of nodes. Every interior node of the tree
represents
a binary
operation, and will be
represented by a node of class Op
.
Every leaf node holds
an integer constant,
and will be represented by a node of class Val
.
Here are definitions for these classes:
class Op: def __init__(self, left, op, right): self.left = left self.op = op self.right = right class Val: def __init__(self, val): self.val = val
We can build the expression tree in the picture above as follows:
left = Op(Val(3), '+', Val(4)) right = Op(Val(2), '+', Val(5)) tree = Op(left, '*', right)
If we have an expression tree in memory, we may wish to evaluate it, which means to determine its value. For example, when we evaluate the expression (4 + 5) * (2 + 1), we get the value 27.
We can evaluate an expression tree with a straightforward recursive function:
def eval(n): if isinstance(n, Val): return n.val # base case assert isinstance(n, Op), 'invalid expression' l = eval(n.left) r = eval(n.right) if n.op == '+': return l + r if n.op == '-': return l - r if n.op == '*': return l * r if n.op == '/': return l // r assert False, 'invalid operator'
Let's try it out on the small expression tree that we built above representing the expression ((3 + 4) * (2 + 5)):
>>> eval(tree) 49 >>>
Given an expression tree, it’s not hard to convert it to a string representation. For example, we can generate an expression in prefix notation as follows:
# Given an expression tree, return an expression in prefix notation. def prefix(n): if isinstance(n, Val): return str(n.val) assert isinstance(n, Op), 'invalid expression' return f'{n.op} {prefix(n.left)} {prefix(n.right)}'
Let's try it on our small expression tree:
>>> prefix(tree) '* + 3 4 + 2 5'
Generating an expression in postfix notation will be similar.
Now let's consider generating an infix expression. We've decided that we will have no operator precedence, so we will generate a fully parenthesized expression that so that our expression will be unambiguous. Here is the code:
def infix(n): if isinstance(n, Val): return str(n.val) assert isinstance(n, Op), 'invalid expression' return f'({infix(n.left)} {n.op} {infix(n.right)})'
Let's try it:
>>> infix(tree) '((3 + 4) * (2 + 5))'
We have seen how to convert an expression tree to a string in infix, prefix, or postfix notation.
Now let's consider the inverse problem: given a string in infix, prefix, or postfix notation we would like to parse it into an expression tree.
Many parsers are divided into two phases. In the first phase, called lexical analysis, we break the input string into a series of tokens. In the second phase we examine the tokens and use the to construct an expression tree.
We will follow
this approach. For our simple expression language, a token will be
either an integer constant (represented by a Python int) or one of
the symbols '(', ')', '+', '-', '*' or '/' (represented by a Python
string). For example, the string '(4
+ 15) * (22 - 1)
' has
these tokens:
( 4 + 15 ) * ( 22 - 1 )
Let's write a class that performs lexical analysis for this simple language.
class Lexer: def __init__(self, s): self.s = s self.i = 0 # position of next token # Read and return the next token (an int or a char), # or return None if there are no more. def next(self): # move past spaces while self.i < len(self.s) and self.s[self.i] == ' ': self.i += 1 if self.i >= len(self.s): return None c = self.s[self.i] self.i += 1 if c.isdigit(): # start of an integer constant t = c while self.i < len(self.s) and self.s[self.i].isdigit(): t += self.s[self.i] self.i += 1 return int(t) assert c in '+-*/()', 'invalid character' return c
Let's try it:
>>> l = Lexer('(4 + 15) * (22 - 1)') >>> l.next() '(' >>> l.next() 4 >>> l.next() '+' >>> l.next() 15 >>> l.next() ')' >>> l.next() '*'
The set of all valid expressions in infix (or prefix, or postfix) syntax forms a language. When we write a parser for any language, it's helpful to have a grammar, which is a formal definition of the language's syntax.
As a
first step, consider the language of arithmetic expressions in
prefix form,
with non-negative integer constants and the binary operators +
,
-
,
*
,
and /
.
We can formally define the syntax
of this language using the following context-free grammar:
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' int = digit+ op = '+' | '-' | '*' | '/' expr = int | op expr expr
Context-free grammars are commonly used to define programming languages. A grammar contains non-terminal symbols (such as digit, op, expr) and terminal symbols (such as '0', '1', '2', …, '+', '-', '*', '/'), which are characters. A grammar consists of a set of production rules that define how each non-terminal symbol can be constructed from other symbols. These rules collectively define which strings of terminal symbols are syntactically valid. (You will study context-free grammars in more detail in more advanced courses, such as an automata course.)
The first production rule above says that a digit
is any of the characters '0', '1', ..., '9'. The second rule
"int =
digit+
"
says that an integer consists of one or more digits. Here, the symbol
+ (which comes from the world of regular expressions, which we will
study later) means "one or more". We have already written a
simple lexical analyzer that embodies these rules.
The third rule above defines the set of valid operator characters. The fourth rule is recursive, and is the most important: it says that every expression is either an integer constant, or an operator followed by two expressions.
Following the grammar above, we would now like to write a function that can parse arithmetic expressions in prefix notation. We'll use the lexical analyzer we wrote above to generate tokens. We will write a nested recursive function parse() that reads a subexpression at the current token position and returns an expression tree for it. Here is the code:
# Given a string containing an expression in prefix syntax, # parse it into an expression tree. def parse_prefix(s): r = Lexer(s) # Parse a subexpression starting at the current point. def parse(): t = r.next() # token if isinstance(t, int): return Val(t) assert t in '+-*/', 'invalid operator' left = parse() right = parse() return Op(left, t, right) return parse()
Notice that the structure of the recursive function parse() follows the grammar rule for prefix expressions:
expr = digit | op expr expr
To put it differently, we have written a recursive-descent parser for this grammar. (Many compilers for real-world programming languages also use recursive-descent parsers, though this is not the only possible way to write a parser.)
Let's parse a prefix expression, then use the resulting expression tree to convert back into prefix notation, and also into infix notation:
>>> e = parse_prefix('* + 3 4 - 5 2') >>> prefix(e) '* + 3 4 - 5 2' >>> infix(e) '((3 + 4) * (5 - 2))' >>>
Parsing an infix expression is not much more difficult. Let's modify the context-free grammar that we wrote above so that it will represent infix expressions. We only need to change the last line:
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' int = digit+ op = '+' | '-' | '*' | '/' expr = int | '(' expr op expr ')'
Notice that in this grammar, all expressions must be fully parenthesized. That is to say, whenever we apply an operator to two operands, we must place parentheses around the result. This ensures that our grammar is unambiguous.
We may now write a recursive function that mirrors the rule for expressions, i.e. the last line in the grammar above:
# Given a string containing an expression in infix syntax, # parse it into an expression tree. def parse_infix(s): reader = Reader(s) # Parse a subexpression starting at the current point. def parse(): t = reader.next() if isinstance(t, int): return Val(t) assert t == '(', 'expected (' left = parse() op = reader.next() assert op in '+-*/', 'invalid operator' right = parse() assert reader.next() == ')', 'expected )' return Op(left, op, right) return parse()
Once we have parsed an infix expression, we can easily evaluate it, or convert to a prefix representation:
>>> e = parse_infix('((2 + 3) * (4 + 5))') >>> eval(e) 45 >>> to_prefix(e) '* + 2 3 + 4 5'
In this lecture we did not have time to discuss how to parse an expression in postfix form. Briefly, a recursive-descent parser will not work for this sort of expression; instead, we need to use a stack. Each time the parser sees an integer constant, it will push a Val node to the stack. When it sees an operator character, it will pop two expressions from the stack and push an Op node that combines them. When it has read the entire input string, the stack will contain a single element, namely the root node of an expression tree for the entire input expression. You may wish to think about this more and attempt to implement a postfix expression parser as an exercise.