Sometimes we write classes whose only purpose is to hold data. For example, consider a class that holds a geometric circle with a center and radius. We might define it like this:
class Circle {
public readonly double x, y;
public readonly double radius;
public Circle(double x, double y, double radius) {
(this.x, this.y) = (x, y);
this.radius = radius;
}
public override bool Equals(object? o) =>
o is Circle d && (x, y, radius) == (d.x, d.y, d.radius);
public override int GetHashCode() => (x, y, radius).GetHashCode();
public override String ToString() =>
$"Circle: x = {x}, y = {y}, radius = {radius}";
}
(Above, the readonly attribute indicates
that a field may be set only in a constructor, and is immutable after
that.)
This is a lot of code to type to define a simple data class. Fortunately C# offers a more compact alternative, namely records. Instead of the definition above, we may define our circle class using a record:
record Circle(double x, double y, double radius);
That was a lot less typing! :) When you define a record, C# synthesizes a class that is similar to the class above. In particular:
The class is immutable: fields cannot be changed after an instance is initialized.
The class supports comparison by structural
equality: two instances are equal if they have the same members
(like in the definition of Equals()
above).
The class has a GetHashCode()
method so that its instances can be used as a hash key.
The class has a ToString()
method that builds a string including all field values.
A record may have methods or other members such as properties and indexers. For example:
record Circle(double x, double y, double radius) {
public double circumference() => 2 * PI * radius;
public double area() => PI * PI * radius;
}We would now like to study how to write an interpreter for a simple programming language. In doing to we will use many of the ideas and techniques that we've seen in our first-year programming classes including recursion, regular expressions, expression trees and so on. Although our language will be simple, many of the concepts that we use will be applicable to interpreters for larger languages, or even when writing a compiler.
Any interpreter or compiler typically processes the input program in a series of phases:
First, we read the input program and use a lexer (or tokenizer) to convert it to a series of tokens. Typically each token represents a single keyword (e.g. "while" or "for"), an identifier (i.e. a variable name), or a punctuation symbol such as ">" or "!=".
We next use a parser to convert the tokens into an abstract syntax tree (AST) in memory.
Now there are several possibilities:
We may interpret the code in the AST directly. The interpreter we describe in these notes will do that.
We may convert the AST into native code, i.e. machine language for a particular CPU architecture. A C or C++ compiler typically works this way.
We may convert the AST into machine-independent byte code. We may then interpret the byte code directly; for example, the most common Python implementation (CPython) does this. Alternatively, we may use a just-in-time compiler (JIT) to convert the byte code into machine language just before executing it on the target machine; for example, C# and Java do this.
You may recognize the phases above: in Introduction to Algorithms we wrote code that can tokenize and parse expressions such as "(2 * (3 + 4))" into an expression tree, and evaluate them. Essentially that was an interpreter for a language of simple arithmetic expressions. We would now like to implement a language that is significantly more powerful.
Let's specify a programming language that we would like to interpret. We will call our language P2 (for "Programming 2"). We would like the language to be powerful enough that we can write e.g. a program that will print out all the prime numbers up to a given integer.
P2 will have the following features:
The types int (a
32-bit signed integer) and bool.
The arithmetic operators +,
-, *, /,
and %. The operators *, /, and % will
have higher precedence than + and -, so 3 + 4
* 5 means 3 + (4 * 5), not (3
+ 4) * 5.
The comparison operators ==,
!=, <,
<=, >,
and >=.
Assignment statements, plus the if
and while statements.
A print function
that can print out any integer or boolean value.
The P2 language will not have character, string, or array types, and will not allow the user to define their own functions.
P2 will use C-like syntax with braces and semicolons, rather than Python-like syntax. That's because the C-like syntax is easier to parse; in particular, we will be able to ignore all whitespace.
Suppose that a P2 program contains the line "x
= 4 + false;". How should we handle this? There are three
possibilities:
We could automatically convert the Boolean
value false into the integer 0, then
perform the addition.
We could consider this to be an error, and detect it before the program starts to run. This is an example of static type checking as found in languages such as C, C++ and C#.
We could consider this to be an error, and detect it at run time, when we attempt to execute the line. This is an example of dynamic type checking as found in languages such as Python.
None of these are particularly difficult to implement. We will decide to implement dynamic type checking in our interpreter.
There are three approaches commonly used for writing a lexer and parser:
1. We can write lexing/parsing code by hand. We first write a grammar for our language, then write lexing and parsing code that follows the structure of the grammar. A parser written in this way is called a recursive-descent parser.
2. We can use lexer generator and parser generator tools. These tools take a grammar as input, and output code that implements a lexer or parser. There are lexer and parser generators for many programming languages: among the most commonly used are Lex, Yacc and Bison.
3. We can use a parser combinator library that lets us express the grammar by calling combinator functions inside our own program. The Parsec library for Haskell is widely used, and has been ported to many programming languages. Pidgin is a similar parser combinator library for C#.
In these notes we will use approach 1: we will write a hand-coded recursive-descent parser for P2. In Non-Procedural Programming next year we may study approach 3 (parser combinators).
To make things easier to understand, we will implement our P2 interpreter in 4 phases:
1. We will handle only expressions involving integers, correctly parsing operators with precedence.
2. We will handle expressions with integers and booleans.
3. We will add assignment statements and calls to
the print() function.
4. We will add the if
and while statements.
In this phase we will handle expressions with integers. We wrote similar code in Introduction to Algorithms, however in that class we required expressions to be fully parenthesized (e.g. "((3 * 4) + (5 * 2))") and did not handle operator precedence. In our parser here we will allow parentheses but will not require them around all operations, and will handle precedence correctly.
We will first write a context-free grammar
that precisely describes the structure of our language, just as we
did in Introduction to Algorithms. A context-free grammar contains
both terminals, which are characters such as '*'
or '0', as well non-terminals,
which are names such as "op" or "expr". The
grammar consists of a set of rules that
describe how each non-terminal may be constructed from other
non-terminal or terminal symbols. Our presentation here will be
relatively informal. You will study context-free grammars more
formally in your automata class next year.
Here is a first attempt for a grammar for phase 1 of our implementation:
(* lexical *)
digit = '0' .. '9';
int = digit+;
(* expressions, first attempt *)
op = '+' | '-' | '*' | '/' | '%';
expr = int | expr op expr | '(' expr ')';This grammar is written in EBNF (Extended Backus-Naur Form), which is a precise specification for writing context-free grammars. Note the following:
| separates
alternative possibilities
.. means a range
of characters
+ means "1 or more"; for example,
"digit+" means one or more
digits
If you are developing a grammar in EBNF, I recommend that you install the handy extension EBNF Language Support for Visual Studio Code. It will highlight your grammar with nice colors, and also check that its syntax is valid.
The grammar above describes valid P2 expressions, but there is a fundamental problem: it is ambiguous. For example, consider the text "3 + 4 * 5". It could be parsed like this:
3 + 4 * 5 expr op expr
Or like this:
3 + 4 * 5 expr op expr
In the first case, we will get an abstract syntax tree representing 3 + (4 * 5); in the second, our AST will represent (3 + 4) * 5.
To remove the ambiguity, we can rewrite the grammar using a different non-terminal symbol for each precedence level. Our next attempt looks like this:
(* expressions, second attempt *)
primary = int | '(' expr ')';
factor = primary | factor ('*' | '/' | '%') factor;
term = factor | term ('+' | '-') term;
expr = term;We can read this as follows. A primary is an integer constant, or a parenthesized expression. A factor consists of one or more primaries; if there are more than one, they are combined using the '*', '/' or '%' operations. A term consists of one or more factors; if there are more than one, they are combined using the '+' or '-' operations. Note that an integer constant is itself a primary, a factor, a term and also an expression.
Now "3 + 4 * 5" can only be parsed like this:
3 + 4 * 5 term term
It cannot be parsed like this, since 3 + 4 is not a factor:
3 + 4 * 5 ??? factor
However there are still two problems with our grammar. First, it is still ambiguous within each precedence level. For example, 3 * 4 % 3 could be parsed like this:
3 * 4 % 3 factor factor
Or like this:
3 * 4 % 3 factor factor
A second problem is that our grammar is left-recursive, which
means that it contains rules that expand a non-terminal to a sequence
beginning with the same non-terminal. For example, we can
expand term --> term
'+' term or factor --> factor
'*' factor. A recursive-descent parser will have one function
for each non-terminal in the grammar. If a rule is left-recursive
then the function for the corresponding non-terminal will begin by
calling itself, which will cause an infinite recursion.
For this reason, if we want to write a recursive-descent parser we must refactor the grammar to eliminate the left recursion. We can do so like this:
(* expressions *)
primary = int | '(' expr ')';
factor = primary { ('*' | '/' | '%') primary };
term = factor { ('+' | '-') factor };
expr = term;
In an EBNF grammar a sequence in braces is repeated 0 or more
times. For example, in the grammar above a factor consists of a
primary, followed by 0 or more repetitions of '*',
'/', or '%'
plus another primary. A term consists of a factor, followed by 0 or
more repetitions of '+' or '-'
plus another factor. In this form the grammar is not left-recursive,
and is unambiguous: there is only one way to parse each expression.
Our goal is to write a parser that will product an
abstract syntax tree (AST), so we must decide how to represent an AST
in our code. We will use two record types: Const
will represent an integer constant, and BinOp
will represent a binary operation (e.g. '+'
or '*') applied to two expressions. Both
of these types will implement an interface Expr,
representing any expression:
interface Expr { }
record Const(int i) : Expr;
record BinOp(Expr e, string op, Expr f) : Expr;Recall that a lexer converts the input into a
sequence of tokens. For a simple language such as P2 the easiest way
to write a lexer is by using regular expressions, using strings to
represent tokens. In this first development phase a token is either
an integer (i.e. a sequence of digits), or any other character that
is not whitespace. In a regular expression \d
matches any digit and \S matches any
non-whitespace character, so the regular expression \d+|\S
will match any token of our language.
For simplicity we will write our lexer and parser
in a single class called Parser. The
Parser constructor will perform lexical
analysis, i.e. will break the input into a series of tokens, which we
will store in a List<string>. The
Parser class will also have a field pos
that represents the current index into the tokens. So our class
begins like this:
class Parser {
List<string> tokens = [];
int pos;
public Parser(string input) {
foreach (Match m in Regex.Matches(input, @"\d+|\S"))
tokens.Add(m.Value);
}
...Next we will write several methods that will be very useful for writing a recursive-descent parser:
next() fetches
the next token and returns it.
peek() looks at
the next token without advancing the token index.
match(string s)
looks at the next token to see if it is s. If so, it advances the
token index and returns true; otherwise it returns false.
Here is their implementation:
string next() => tokens[pos++];
string? peek() => pos < tokens.Count ? tokens[pos] : null;
bool match(string s) {
if (peek() == s) {
pos += 1;
return true;
}
return false;
}Now we will write the parser itself. As we mentioned above, each non-terminal in the grammar will become a function. The code will directly mirror the grammar: the function implementing a non-terminal T must explore all possible expansions for T in the grammar, calling functions that represent other non-terminals as necessary. For example, the grammar rule
primary = int | '(' expr ')';becomes this function:
public Expr parse_primary() {
string t = next();
if (t == "(") {
Expr e = parse_expr();
assert(match(")"));
return e;
}
return new Const(int.Parse(t));
}The grammar rule
factor = primary {('*' | '/' | '%') primary};becomes this function:
public Expr parse_factor() {
Expr e = parse_primary();
while (peek() == "*" || peek() == "/" || peek() == "%")
e = new BinOp(e, next(), parse_primary());
return e;
}We can translate a grammar into a recursive-descent parser somewhat mechanically. In fact, as we mentioned above there are parser generator tools that can automatically generate parser code when given a grammar as input. However it is an excellent exercise to write a recursive-descent parser yourself, and in fact many real-world programming language implementations have hand-written parsers of this nature.
Here is the entire parser for P2 phase 1; it is about 55 lines long.
We must now write code to evaluate an expression
represented by an abstract syntax tree. As we saw in Introduction to
Algorithms, we can implement this straightforwardly using recursion.
The Expr interface will declare an
eval() method, which we will implement
separately for each node type (Const or
BinOp). Here is the code:
interface Expr {
int eval();
}
record Const(int i) : Expr {
public int eval() => i;
}
record BinOp(Expr e, string op, Expr f) : Expr {
public int eval() {
int x = e.eval(), y = f.eval();
return op switch {
"+" => x + y,
"-" => x - y,
"*" => x * y,
"/" => x / y,
"%" => x % y,
_ => throw new Exception("bad operation")
};
}
}Here is the entire interpreter for P2 phase 1 including a Main method (about 35 lines in all). It works with the parser code linked to above. Here are some expressions that we can now evaluate:
(4 + 12) * (3 - 4)
2 + 30 * 40 + 50 * 60
In this development phase we will extend expressions so that they may compute either integer or boolean values, and we will add the boolean comparison operators ==, >, and so on to our language.
In phase 1 the eval()
method returned an integer, but now it must be able to return either
an integer or boolean. So we will now define an interface Value
that represents either of these types, with record types Bool
and Int that implement Value.
It will also be useful to have a method as_int()
that converts a Value to a C# int,
or throws an exception if the Value does
not represent an integer. We will also write a similar method
as_bool(), as well as a to_string()
method that we will use for printing values. Here are our
definitions:
interface Value {
string to_string();
bool as_bool() => throw new Exception("bool expected");
int as_int() => throw new Exception("int expected");
}
record Bool(bool b) : Value {
public bool as_bool() => b;
public string to_string() => b.ToString();
}
record Int(int i) : Value {
public int as_int() => i;
public string to_string() => i.ToString();
}In this phase we will make a couple of modest extensions to the grammar. First, we will allow the Boolean constants true and false to appear literally in a P2 program:
primary = 'false' | 'true' | int | '(' expr ')';We will also add a new non-terminal that handles comparison operators:
comparison = term [('==' | '!=' | '<' | '<=' | '>' | '>=') term];
expr = comparison;
We must also extend the regular expression that recognizes tokens.
Now any sequence of lowercase letters will be a single token, so that
the Boolean constants false and true
will be tokens. Also, a sequence of punctuation characters such as >=
or <= must be a single token:
public Parser(string input) {
foreach (Match m in Regex.Matches(input, @"\d+|[a-z]+|[=!<>]+|\S"))
tokens.Add(m.Value);
}
In our abstract syntax we will modify the Const
type so that it holds a Value (which can
be an integer or boolean):
record Const(Value v) : Expr {
public Value eval() => v;
}Here is the complete parser for P2 phase 2 (about 65 lines of code).
In the BinOp class we
will now enhance the eval() method to
handle all of the operators that we now support. Because eval() will
now return a Value, our code for each
operator must convert each operand from a Value
to an appropriate C# type, and must also wrap the result in an Int
or Bool as appropriate. Here is the
code:
record BinOp(Expr e, string op, Expr f) : Expr {
public Value eval() {
Value x = e.eval(), y = f.eval();
return op switch {
"+" => new Int(x.as_int() + y.as_int()),
"-" => new Int(x.as_int() - y.as_int()),
"*" => new Int(x.as_int() * y.as_int()),
"/" => new Int(x.as_int() / y.as_int()),
"%" => new Int(x.as_int() % y.as_int()),
"==" => new Bool(x.Equals(y)),
"!=" => new Bool(!x.Equals(y)),
"<" => new Bool(x.as_int() < y.as_int()),
"<=" => new Bool(x.as_int() <= y.as_int()),
">" => new Bool(x.as_int() > y.as_int()),
">=" => new Bool(x.as_int() >= y.as_int()),
_ => throw new Exception("bad operation")
};
}
}Here is the complete interpreter for P2 phase 2 including a Main method (about 55 lines of code). It works with the phase 2 parser linked to above. Here are some expressions that we can now evaluate:
4 * 5 > 5 * 6
4 < 5 + 1 == true
true == false
In this phase we will add statements to the P2 language. We will have two sorts of statements: assignment statements and print statements. Once this phase is complete, we will be able to run a program such as the following:
x = 2 + 3 y = x + 1 z = x * y print(z) z = z + 1 print(z + 10)
We will extend our grammar so that expressions can hold variables:
letter = 'a' .. 'z';
id = letter+;
primary = 'false' | 'true' | int | id | '(' expr ')';
In the abstract syntax we will now have a Var
type representing a variable:
record Var(string s) : Expr;
We can now write the grammar for statements:
paren_expr = '(' expr ')';
print = 'print' paren_expr ';';
assignment = id '=' expr ';';
statement = print | assignment;
program = {statement};
In the abstract syntax we will now have an interface Stmt
for any kind of statement, along with types Assign
and Print for print and assignment
statements. We will also have a type Block
that holds a series of statements; a top-level program will be a
Block.
interface Stmt { }
record Assign(string id, Expr e) : Stmt;
record Print(Expr e) : Stmt;
record Block(List<Stmt> stmts) : Stmt;
We can easily enhance our parser to handle statements. For example,
we will add a parse_print() function
that parses a print statement:
public Stmt? parse_print() {
if (match("print")) {
Print p = new(parse_paren_expr());
assert(match(";"));
return p;
}
return null;
}
Notice that this function returns a Stmt
object if it can parse one, otherwise null. Our function
parse_statement() for parsing any sort
of statement looks like this:
public Stmt parse_statement() =>
parse_print() ?? parse_assignment();The ?? operator will return its left operand if it is not null, otherwise its right operand. It is handy for chaining together two or more alternatives in a parser.
Here is the complete parser for P2 phase 3 (about 100 lines of code).
We must now enhance our interpreter to handle statements. Because we have variables, we must now evaluate expressions and run statements in the context of an environment that holds the current value of each variable. We will represent an environment using a C# dictionary:
using Env = System.Collections.Generic.Dictionary<string, Value>;
The eval() method will now take an Env
as an argument:
interface Expr {
Value eval(Env env);
}
Evaluating a Var looks up its value in
the environment:
record Var(string s) : Expr {
public Value eval(Env env) => env[s];
}
The Stmt interface has a method run(Env
env) that runs a statement in a given environment. Each type
of statement has its own implementation of this method. For example,
an Assign statement assigns a value to a
variable in the environment:
record Assign(string id, Expr e) : Stmt {
public void run(Env env) {
env[id] = e.eval(env);
}
}
A Block is a compound statement, so
running a Block runs all the statements
it contains:
record Block(List<Stmt> stmts) : Stmt {
public void run(Env env) {
foreach (Stmt stmt in stmts)
stmt.run(env);
}
}
Here is the complete interpreter
for P2 phase 3 including a Main method
(about 85 lines of code). It works with the phase 3 parser linked to
above.
In this phase we will complete our P2
implementation by adding the if and
while statements. This will greatly
increase the power of the language: with conditionals and loops we
can write many sorts of program.
Actually this is the easiest of our four implementation phases! With all of the infrastructure that we built in the preceding 3 phases, our changes in this phase will be straightforward. Here are the grammar enhancements we need:
block = '{' {statement} '}';
if = 'if' paren_expr block;
while = 'while' paren_expr block;
statement = if | while | print | assignment;
We will extend our abstract syntax with types If
and While for representing if
and while statements:
record If(Expr e, Stmt stmt) : Stmt; record While(Expr e, Stmt stmt) : Stmt;
We can enhance the recursive-descent parser to handle these
statements in a straightforward way. Our parse_statement()
function now looks like this:
public Stmt parse_statement() =>
parse_if() ?? parse_while() ?? parse_print() ?? parse_assignment();Here is the complete parser for our final P2 implementation (about 110 lines of code).
Finally, we must implement the run()
method for if and while
statements. That is not difficult:
record If(Expr e, Stmt stmt) : Stmt {
public void run(Env env) {
if (e.eval(env).as_bool())
stmt.run(env);
}
}
record While(Expr e, Stmt stmt) : Stmt {
public void run(Env env) {
while (e.eval(env).as_bool())
stmt.run(env);
}
}
Notice that we implement P2's if
statement using a C# if statement, and
implement P2's while statement using a
C# while statement! That works because
these statements exist in both P2 and C# and are essentially
identical in the two languages.
Here is the
complete interpreter for our final P2 implementation including a
Main method (about 100 lines of code).
It works with the parser linked to above.
At last our implementation is complete. We can now run the following P2 program to print all prime numbers below 100:
n = 2;
while (n < 100) {
i = 2;
prime = true;
while (i * i <= n) {
x = n % i;
if (n % i == 0) {
prime = false;
}
i = i + 1;
}
if (prime) {
print(n);
}
n = n + 1;
}The output is
$ dotnet run primes.p2 2 3 5 7 11 ... 83 89 97 $