Arithmetic expressions are composed from
numeric constants and from operators that act on those
numbers. In this section we'll use non-negative integer constants
with the binary operators +, -, *
and /, the last of which denotes integer division.
In traditional mathematical notation, these
operators are infix operators: they are written between the
values that they act on (which are called operands). For
example, we write 2 + 2, or 8 - 7. In this
last expression, 8 and 7 are the operands.
Here are some arithmetic expressions written using infix notation:
(4
+ 5) * (2 + 1)
(18
/
2)
-
1
In infix notation we may write parentheses to
distinguish expressions that would otherwise be ambiguous. For
example, 4 + 5 * 9 could mean either (4
+ 5) * 9 or 4 + (5 * 9). Another way to
disambiguate expressions is using operator precedence. For
example, * is generally considered to have higher
precedence than +. However, here we will assume that no
such precedence exists; we must use parentheses to disambiguate.
We may choose an alternative syntax that uses
prefix notation, in which we write each operator before its
operands. For example, we write + 2 4 to mean the
sum of 2 and 4, or / 8 2 to mean 8 divided by 2. Here
are the above expressions rewritten in prefix notation:
* + 4 5 + 2 1 - / 18 2 1
Or we may use postfix notation, in which operators come after
both operands: we might write 4 5 + to mean the sum of 4
and 5. Here are the above expressions in postfix notation:
4 5 + 2 1 + * 18 2 / 1 -
In prefix or postfix notation there is no need either for parentheses
or operator precedence, because expressions are inherently
unambiguous! For example, the prefix expression * + 4 5 9
is equivalent to ((4 + 5) * 9), and the prefix
expression + 4 * 5 9 is equivalent to (4 + (5 *
9)). There is no danger of confusing these in prefix form,
even without parentheses.
We will now discuss how to store and work with arithmetic expressions in a program.
We may store expressions using an expression tree, which is a form of abstract syntax tree. An expression tree reflects an expression’s hierarchical structure. For example, here is an expression tree for the infix expression ((3 + 4) * (2 + 5)):
Note that this expression tree also corresponds to the prefix expression * + 3 4 + 2 5, or the postfix expression 3 4 + 2 5 + *. Equivalent expressions in infix, prefix or postfix form will always have the same tree! That’s because this tree reflects the abstract syntax of the expression, which is independent of concrete syntax which defines how expressions are written as strings of symbols.
We can store expression trees using
objects in Python.
We'll have a parent class Expr representing any expression, with two
subclasses for our different node types. Every leaf node holds an
integer constant, and will be represented by a node of class IntExpr.
Every interior node of the tree represents
a binary operation, and will be represented by a node of class
OpExpr. Here are definitions for these classes:
class Expr:
pass
class IntExpr(Expr):
def __init__(self, val):
self.val = val
class OpExpr(Expr):
def __init__(self, op, left, right):
self.op = op
self.left, self.right = left, rightWe can build the expression tree in the picture above as follows:
left = OpExpr('+', IntExpr(3), IntExpr(4))
right = OpExpr('+', IntExpr(2), IntExpr(5))
tree = OpExpr('*', left, right)If we have an expression tree in memory, we may wish to evaluate it, which means to compute its value. For example, when we evaluate the expression (4 + 5) * (2 + 1), we get the value 27.
Let's write a method eval() for evaluating an expression, with implementations in the IntExpr and OpExpr classes:
# in class IntExpr
def eval(self):
return self.val
# in class OpExpr
def eval(self):
l = self.left.eval()
r = self.right.eval()
match self.op:
case '+': return l + r
case '-': return l - r
case '*': return l * r
case '/': return l // r
case '_': assert FalseLet's try it out on the small expression tree that we built above representing the expression ((3 + 4) * (2 + 5)):
>>> tree.eval() 49 >>>
Given an expression tree, it’s not hard to convert it to a string representation. For example, we can generate an expression in prefix notation as follows:
# in class IntExpr
def to_prefix(self):
return str(self.val)
# in class OpExpr
def to_prefix(self):
return f'{self.op} {self.left.to_prefix()} {self.right.to_prefix()}'Let's try it on our small expression tree:
>>> tree.to_prefix() '* + 3 4 + 2 5'
Generating an expression in postfix notation will be similar.
Now let's consider generating an infix expression. We've decided that we will have no operator precedence, so we will generate a fully parenthesized expression that so that our expression will be unambiguous. Here is the code:
# in class IntExpr
def to_infix(self):
return str(self.val)
# in class OpExpr
def to_infix(self):
return f'({self.left.to_infix()} {self.op} {self.right.to_infix()})'Let's try it:
>>> tree.to_infix() '((3 + 4) * (2 + 5))'
We have seen how to convert an expression tree to a string in infix, prefix, or postfix notation.
Now let's consider the inverse problem: given a string in infix, prefix, or postfix notation we would like to parse it into an expression tree.
Many parsers are divided into two phases. In the first phase, called lexical analysis, we break the input string into a series of tokens. In the second phase we examine the tokens and use them to construct an expression tree.
We will follow this approach. For our simple
expression language, a token will be either an integer constant
(represented by a Python int) or one of the symbols '(', ')', '+',
'-', '*' or '/' (represented by a Python string). For example, the
string '(4 + 15) * (22 - 1)' has these tokens:
( 4 + 15 ) * ( 22 - 1 )
Let's write a class that performs lexical analysis for this simple language.
class Lexer:
def __init__(self, s):
self.s = s
self.i = 0 # position of next token
# Read and return the next token (an int or a char),
# or return None if there are no more.
def next(self):
# move past spaces
while self.i < len(self.s) and self.s[self.i] == ' ':
self.i += 1
if self.i >= len(self.s):
return None
c = self.s[self.i]
self.i += 1
if c.isdigit(): # start of an integer constant
t = c
while self.i < len(self.s) and self.s[self.i].isdigit():
t += self.s[self.i]
self.i += 1
return int(t)
assert c in '+-*/()', 'invalid character'
return cLet's try it:
>>> l = Lexer('(4 + 15) * (22 - 1)')
>>> l.next()
'('
>>> l.next()
4
>>> l.next()
'+'
>>> l.next()
15
>>> l.next()
')'
>>> l.next()
'*'The set of all valid expressions in infix (or prefix, or postfix) syntax forms a language. When we write a parser for any language, it's helpful to have a grammar, which is a formal definition of the language's syntax.
As a
first step, consider the language of arithmetic expressions in
prefix form, with non-negative integer constants and the binary
operators +, -, *, and /.
We can formally define the syntax of this language using the
following context-free grammar:
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' int = digit | int digit op = '+' | '-' | '*' | '/' expr = int | op expr expr
Context-free grammars are commonly used to define programming languages. A grammar contains non-terminal symbols (such as digit, op, expr) and terminal symbols (such as '0', '1', '2', …, '+', '-', '*', '/'), which are characters. A grammar consists of a set of production rules that define how each non-terminal symbol can be constructed from other symbols. These rules collectively define which strings of terminal symbols are syntactically valid. (You will study context-free grammars in more detail in more advanced courses, such as an automata course.)
The first production rule above says that a digit
is any of the characters '0', '1', ..., '9'. The second rule
"int = digit | int digit" says that
an integer is either a digit, or an integer followed by a digit. To
put it differently, this is a recursive rule stating that an integer
consists of any number of digits. We have already written a simple
lexical analyzer that embodies these first two rules.
The third rule above defines the set of valid operator characters. The fourth rule is recursive, and is the most important: it says that every expression is either an integer constant, or an operator followed by two expressions.
Following the grammar above, we would now like to write a function that can parse arithmetic expressions in prefix notation. We'll use the lexical analyzer we wrote above to generate tokens. We will write a nested recursive function parse() that reads a subexpression at the current token position and returns an expression tree for it. Here is the code:
# Given a string containing an expression in prefix syntax,
# parse it into an expression tree.
def parse_prefix(s):
r = Lexer(s)
# Parse a subexpression starting at the current point.
def parse():
t = r.next() # token
if isinstance(t, int):
return IntExpr(t)
assert t in '+-*/', 'invalid operator'
left = parse()
right = parse()
return OpExpr(t, left, right)
return parse()Notice that the structure of the recursive function parse() follows the grammar rule for prefix expressions:
expr = digit | op expr expr
To put it differently, we have written a recursive-descent parser for this grammar. In fact many compilers for real-world programming languages also use recursive-descent parsers (though this is not the only possible way to write a parser).
Let's parse a prefix expression, then use the resulting expression tree to convert back into prefix notation, and also into infix notation:
>>> e = parse_prefix('* + 3 4 - 5 2')
>>> e.to_prefix()
'* + 3 4 - 5 2'
>>> e.to_infix()
'((3 + 4) * (5 - 2))'
>>> Parsing an infix expression is not much more difficult. Let's modify the context-free grammar that we wrote above so that it will represent infix expressions. We only need to change the last line:
digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
int = digit | int digit
op = '+' | '-' | '*' | '/'
expr = int | '(' expr op expr ')'Notice that in this grammar, all expressions must be fully parenthesized. That is to say, whenever we apply an operator to two operands, we must place parentheses around the result. This ensures that our grammar is unambiguous.
We may now write a recursive function that mirrors the rule for expressions, i.e. the last line in the grammar above:
# Given a string containing an expression in infix syntax,
# parse it into an expression tree.
def parse_infix(s):
r = Lexer(s)
# Parse a subexpression starting at the current point.
def parse():
t = r.next()
if isinstance(t, int):
return IntExpr(t)
assert t == '(', 'expected ('
left = parse()
op = r.next()
assert op in '+-*/', 'invalid operator'
right = parse()
assert r.next() == ')', 'expected )'
return OpExpr(op, left, right)
return parse()Once we have parsed an infix expression, we can easily evaluate it, or convert to a prefix representation:
>>> e = parse_infix('((2 + 3) * (4 + 5))')
>>> e.eval()
45
>>> e.to_prefix()
'* + 2 3 + 4 5'In this lecture we did not have time to discuss how to parse an expression in postfix form. Briefly, a recursive-descent parser will not work for this sort of expression; instead, we need to use a stack. Each time the parser sees an integer constant, it will push an IntExpr node to the stack. When it sees an operator character, it will pop two expressions from the stack and push an OpExpr node that combines them. When it has read the entire input string, the stack will contain a single element, namely the root node of an expression tree for the entire input expression. You may wish to think about this more and attempt to implement a postfix expression parser as an exercise.