Week 13: Notes

records

Sometimes we write classes whose only purpose is to hold data. For example, consider a class that holds a geometric circle with a center and radius. We might define it like this:

class Circle {
    public readonly double x, y;
    public readonly double radius;

    public Circle(double x, double y, double radius) {
        (this.x, this.y) = (x, y);
        this.radius = radius;
    }

    public override bool Equals(object? o) =>
        o is Circle d && (x, y, radius) == (d.x, d.y, d.radius);

    public override int GetHashCode() => (x, y, radius).GetHashCode();

    public override String ToString() =>
        $"Circle: x = {x}, y = {y}, radius = {radius}";
}

(Above, the readonly attribute indicates that a field may be set only in a constructor, and is immutable after that.)

This is a lot of code to type to define a simple data class. Fortunately C# offers a more compact alternative, namely records. Instead of the definition above, we may define our circle class using a record:

record Circle(double x, double y, double radius);

That was a lot less typing! :) When you define a record, C# synthesizes a class that is similar to the class above. In particular:

The class is immutable: fields cannot be changed after an instance is initialized.
The class supports comparison by structural equality: two instances are equal if they have the same members (like in the definition of Equals() above).
The class has a GetHashCode() method so that its instances can be used as a hash key.
The class has a ToString() method that builds a string including all field values.

A record may have methods or other members such as properties and indexers. For example:

record Circle(double x, double y, double radius) {
    public double circumference() => 2 * PI * radius;
    
    public double area() => PI * PI * radius;
}

writing an interpreter

We would now like to study how to write an interpreter for a simple programming language. In doing to we will use many of the ideas and techniques that we've seen in our first-year programming classes including recursion, regular expressions, expression trees and so on. Although our language will be simple, many of the concepts that we use will be applicable to interpreters for larger languages, or even when writing a compiler.

Any interpreter or compiler typically processes the input program in a series of phases:

First, we read the input program and use a lexer (or tokenizer) to convert it to a series of tokens. Typically each token represents a single keyword (e.g. "while" or "for"), an identifier (i.e. a variable name), or a punctuation symbol such as ">" or "!=".
We next use a parser to convert the tokens into an abstract syntax tree (AST) in memory.
Now there are several possibilities:
- We may interpret the code in the AST directly. The interpreter we describe in these notes will do that.
- We may convert the AST into native code, i.e. machine language for a particular CPU architecture. A C or C++ compiler typically works this way.
- We may convert the AST into machine-independent byte code. We may then interpret the byte code directly; for example, the most common Python implementation (CPython) does this. Alternatively, we may use a just-in-time compiler (JIT) to convert the byte code into machine language just before executing it on the target machine; for example, C# and Java do this.

You may recognize the phases above: in Introduction to Algorithms we wrote code that can tokenize and parse expressions such as "(2 * (3 + 4))" into an expression tree, and evaluate them. Essentially that was an interpreter for a language of simple arithmetic expressions. We would now like to implement a language that is significantly more powerful.

the P2 language

Let's specify a programming language that we would like to interpret. We will call our language P2 (for "Programming 2"). We would like the language to be powerful enough that we can write e.g. a program that will print out all the prime numbers up to a given integer.

P2 will have the following features:

The types int (a 32-bit signed integer) and bool.
The arithmetic operators +, -, *, /, and %. The operators *, /, and % will have higher precedence than + and -, so 3 + 4 * 5 means 3 + (4 * 5), not (3 + 4) * 5.
The comparison operators ==, !=, <, <=, >, and >=.
Assignment statements, plus the if and while statements.
A print function that can print out any integer or boolean value.

The P2 language will not have character, string, or array types, and will not allow the user to define their own functions.

P2 will use C-like syntax with braces and semicolons, rather than Python-like syntax. That's because the C-like syntax is easier to parse; in particular, we will be able to ignore all whitespace.

Suppose that a P2 program contains the line "x = 4 + false;". How should we handle this? There are three possibilities:

We could automatically convert the Boolean value false into the integer 0, then perform the addition.
We could consider this to be an error, and detect it before the program starts to run. This is an example of static type checking as found in languages such as C, C++ and C#.
We could consider this to be an error, and detect it at run time, when we attempt to execute the line. This is an example of dynamic type checking as found in languages such as Python.

None of these are particularly difficult to implement. We will decide to implement dynamic type checking in our interpreter.

writing a lexer and parser

There are three approaches commonly used for writing a lexer and parser:

1. We can write lexing/parsing code by hand. We first write a grammar for our language, then write lexing and parsing code that follows the structure of the grammar. A parser written in this way is called a recursive-descent parser.

2. We can use lexer generator and parser generator tools. These tools take a grammar as input, and output code that implements a lexer or parser. There are lexer and parser generators for many programming languages: among the most commonly used are Lex, Yacc and Bison.

3. We can use a parser combinator library that lets us express the grammar by calling combinator functions inside our own program. The Parsec library for Haskell is widely used, and has been ported to many programming languages. Pidgin is a similar parser combinator library for C#.

In these notes we will use approach 1: we will write a hand-coded recursive-descent parser for P2. In Non-Procedural Programming next year we may study approach 3 (parser combinators).

implementation plan

To make things easier to understand, we will implement our P2 interpreter in 4 phases:

1. We will handle only expressions involving integers, correctly parsing operators with precedence.

2. We will handle expressions with integers and booleans.

3. We will add assignment statements and calls to the print() function.

4. We will add the if and while statements.

phase 1: integer expressions

In this phase we will handle expressions with integers. We wrote similar code in Introduction to Algorithms, however in that class we required expressions to be fully parenthesized (e.g. "((3 * 4) + (5 * 2))") and did not handle operator precedence. In our parser here we will allow parentheses but will not require them around all operations, and will handle precedence correctly.

grammar

We will first write a context-free grammar that precisely describes the structure of our language, just as we did in Introduction to Algorithms. A context-free grammar contains both terminals, which are characters such as '*' or '0', as well non-terminals, which are names such as "op" or "expr". The grammar consists of a set of rules that describe how each non-terminal may be constructed from other non-terminal or terminal symbols. Our presentation here will be relatively informal. You will study context-free grammars more formally in your automata class next year.

Here is a first attempt for a grammar for phase 1 of our implementation:

(* lexical *)

digit = '0' .. '9';

int = digit+;

(* expressions, first attempt *)

op = '+' | '-' | '*' | '/' | '%';

expr = int | expr op expr | '(' expr ')';

This grammar is written in EBNF (Extended Backus-Naur Form), which is a precise specification for writing context-free grammars. Note the following:

| separates alternative possibilities
.. means a range of characters
+ means "1 or more"; for example, "digit+" means one or more digits

If you are developing a grammar in EBNF, I recommend that you install the handy extension EBNF Language Support for Visual Studio Code. It will highlight your grammar with nice colors, and also check that its syntax is valid.

The grammar above describes valid P2 expressions, but there is a fundamental problem: it is ambiguous. For example, consider the text "3 + 4 * 5". It could be parsed like this:

3      +     4 * 5
expr   op    expr

Or like this:

3 + 4     *      5
expr      op     expr

In the first case, we will get an abstract syntax tree representing 3 + (4 * 5); in the second, our AST will represent (3 + 4) * 5.

To remove the ambiguity, we can rewrite the grammar using a different non-terminal symbol for each precedence level. Our next attempt looks like this:

(* expressions, second attempt *)

primary = int | '(' expr ')';
factor = primary | factor ('*' | '/' | '%') factor;
term = factor | term ('+' | '-') term;
expr = term;

We can read this as follows. A primary is an integer constant, or a parenthesized expression. A factor consists of one or more primaries; if there are more than one, they are combined using the '*', '/' or '%' operations. A term consists of one or more factors; if there are more than one, they are combined using the '+' or '-' operations. Note that an integer constant is itself a primary, a factor, a term and also an expression.

Now "3 + 4 * 5" can only be parsed like this:

3      +     4 * 5
term         term

It cannot be parsed like this, since 3 + 4 is not a factor:

3 + 4     *      5
???              factor

However there are still two problems with our grammar. First, it is still ambiguous within each precedence level. For example, 3 * 4 % 3 could be parsed like this:

3      *     4 % 3
factor       factor

Or like this:

3 * 4  %     3
factor       factor

A second problem is that our grammar is left-recursive, which means that it contains rules that expand a non-terminal to a sequence beginning with the same non-terminal. For example, we can expand term --> term '+' term or factor --> factor '*' factor. A recursive-descent parser will have one function for each non-terminal in the grammar. If a rule is left-recursive then the function for the corresponding non-terminal will begin by calling itself, which will cause an infinite recursion.

For this reason, if we want to write a recursive-descent parser we must refactor the grammar to eliminate the left recursion. We can do so like this:

(* expressions *)

primary = int | '(' expr ')';

factor = primary { ('*' | '/' | '%') primary };

term = factor { ('+' | '-') factor };

expr = term;

In an EBNF grammar a sequence in braces is repeated 0 or more times. For example, in the grammar above a factor consists of a primary, followed by 0 or more repetitions of '*', '/', or '%' plus another primary. A term consists of a factor, followed by 0 or more repetitions of '+' or '-' plus another factor. In this form the grammar is not left-recursive, and is unambiguous: there is only one way to parse each expression.

representing abstract syntax

Our goal is to write a parser that will product an abstract syntax tree (AST), so we must decide how to represent an AST in our code. We will use two record types: Const will represent an integer constant, and BinOp will represent a binary operation (e.g. '+' or '*') applied to two expressions. Both of these types will implement an interface Expr, representing any expression:

interface Expr { }

record Const(int i) : Expr;

record BinOp(Expr e, string op, Expr f) : Expr;

writing a lexer and parser

Recall that a lexer converts the input into a sequence of tokens. For a simple language such as P2 the easiest way to write a lexer is by using regular expressions, using strings to represent tokens. In this first development phase a token is either an integer (i.e. a sequence of digits), or any other character that is not whitespace. In a regular expression \d matches any digit and \S matches any non-whitespace character, so the regular expression \d+|\S will match any token of our language.

For simplicity we will write our lexer and parser in a single class called Parser. The Parser constructor will perform lexical analysis, i.e. will break the input into a series of tokens, which we will store in a List<string>. The Parser class will also have a field pos that represents the current index into the tokens. So our class begins like this:

class Parser {
    List<string> tokens = [];
    int pos;

    public Parser(string input) {
        foreach (Match m in Regex.Matches(input, @"\d+|\S"))
            tokens.Add(m.Value);
    }
...

Next we will write several methods that will be very useful for writing a recursive-descent parser:

next() fetches the next token and returns it.
peek() looks at the next token without advancing the token index.
match(string s) looks at the next token to see if it is s. If so, it advances the token index and returns true; otherwise it returns false.

Here is their implementation:

    string next() => tokens[pos++];

    string? peek() => pos < tokens.Count ? tokens[pos] : null;

    bool match(string s) {
        if (peek() == s) {
            pos += 1;
            return true;
        }
        return false;
    }

Now we will write the parser itself. As we mentioned above, each non-terminal in the grammar will become a function. The code will directly mirror the grammar: the function implementing a non-terminal T must explore all possible expansions for T in the grammar, calling functions that represent other non-terminals as necessary. For example, the grammar rule

primary = int | '(' expr ')';

becomes this function:

public Expr parse_primary() {
        string t = next();
        if (t == "(") {
            Expr e = parse_expr();
            assert(match(")"));
            return e;
        }
        return new Const(int.Parse(t));
    }

The grammar rule

factor = primary {('*' | '/' | '%') primary};

becomes this function:

public Expr parse_factor() {
        Expr e = parse_primary();
        while (peek() == "*" || peek() == "/" || peek() == "%")
            e = new BinOp(e, next(), parse_primary());
        return e;
    }

We can translate a grammar into a recursive-descent parser somewhat mechanically. In fact, as we mentioned above there are parser generator tools that can automatically generate parser code when given a grammar as input. However it is an excellent exercise to write a recursive-descent parser yourself, and in fact many real-world programming language implementations have hand-written parsers of this nature.

Here is the entire parser for P2 phase 1; it is about 55 lines long.

evaluating expressions

We must now write code to evaluate an expression represented by an abstract syntax tree. As we saw in Introduction to Algorithms, we can implement this straightforwardly using recursion. The Expr interface will declare an eval() method, which we will implement separately for each node type (Const or BinOp). Here is the code:

interface Expr {
    int eval();
}

record Const(int i) : Expr {
    public int eval() => i;
}

record BinOp(Expr e, string op, Expr f) : Expr {
    public int eval() {
        int x = e.eval(), y = f.eval();
        return op switch {
            "+" => x + y,
            "-" => x - y,
            "*" => x * y,
            "/" => x / y,
            "%" => x % y,
            _ => throw new Exception("bad operation")
        };
    }
}

Here is the entire interpreter for P2 phase 1 including a Main method (about 35 lines in all). It works with the parser code linked to above. Here are some expressions that we can now evaluate:

(4 + 12) * (3 - 4)
2 + 30 * 40 + 50 * 60

phase 2: expressions with integers and booleans

In this development phase we will extend expressions so that they may compute either integer or boolean values, and we will add the boolean comparison operators ==, >, and so on to our language.

In phase 1 the eval() method returned an integer, but now it must be able to return either an integer or boolean. So we will now define an interface Value that represents either of these types, with record types Bool and Int that implement Value. It will also be useful to have a method as_int() that converts a Value to a C# int, or throws an exception if the Value does not represent an integer. We will also write a similar method as_bool(), as well as a to_string() method that we will use for printing values. Here are our definitions:

interface Value {
    string to_string();
    bool as_bool() => throw new Exception("bool expected");
    int as_int() => throw new Exception("int expected");
}

record Bool(bool b) : Value {
    public bool as_bool() => b;
    public string to_string() => b.ToString();
}

record Int(int i) : Value {
    public int as_int() => i;
    public string to_string() => i.ToString();
}

In this phase we will make a couple of modest extensions to the grammar. First, we will allow the Boolean constants true and false to appear literally in a P2 program:

primary = 'false' | 'true' | int | '(' expr ')';

We will also add a new non-terminal that handles comparison operators:

comparison = term [('==' | '!=' | '<' | '<=' | '>' | '>=') term];

expr = comparison;

We must also extend the regular expression that recognizes tokens. Now any sequence of lowercase letters will be a single token, so that the Boolean constants false and true will be tokens. Also, a sequence of punctuation characters such as >= or <= must be a single token:

    public Parser(string input) {
        foreach (Match m in Regex.Matches(input, @"\d+|[a-z]+|[=!<>]+|\S"))
            tokens.Add(m.Value);
    }

In our abstract syntax we will modify the Const type so that it holds a Value (which can be an integer or boolean):

record Const(Value v) : Expr {
    public Value eval() => v;
}

Here is the complete parser for P2 phase 2 (about 65 lines of code).

In the BinOp class we will now enhance the eval() method to handle all of the operators that we now support. Because eval() will now return a Value, our code for each operator must convert each operand from a Value to an appropriate C# type, and must also wrap the result in an Int or Bool as appropriate. Here is the code:

record BinOp(Expr e, string op, Expr f) : Expr {
    public Value eval() {
        Value x = e.eval(), y = f.eval();
        return op switch {
            "+" => new Int(x.as_int() + y.as_int()),
            "-" => new Int(x.as_int() - y.as_int()),
            "*" => new Int(x.as_int() * y.as_int()),
            "/" => new Int(x.as_int() / y.as_int()),
            "%" => new Int(x.as_int() % y.as_int()),
            "==" => new Bool(x.Equals(y)),
            "!=" => new Bool(!x.Equals(y)),
            "<" => new Bool(x.as_int() < y.as_int()),
            "<=" => new Bool(x.as_int() <= y.as_int()),
            ">" => new Bool(x.as_int() > y.as_int()),
            ">=" => new Bool(x.as_int() >= y.as_int()),
            _ => throw new Exception("bad operation")
        };
    }
}

Here is the complete interpreter for P2 phase 2 including a Main method (about 55 lines of code). It works with the phase 2 parser linked to above. Here are some expressions that we can now evaluate:

4 * 5 > 5 * 6
4 < 5 + 1 == true
true == false

phase 3: assignment statements and variables

In this phase we will add statements to the P2 language. We will have two sorts of statements: assignment statements and print statements. Once this phase is complete, we will be able to run a program such as the following:

x = 2 + 3
y = x + 1
z = x * y
print(z)
z = z + 1
print(z + 10)

We will extend our grammar so that expressions can hold variables:

letter = 'a' .. 'z';

id = letter+;

primary = 'false' | 'true' | int | id | '(' expr ')';

In the abstract syntax we will now have a Var type representing a variable:

record Var(string s) : Expr;

We can now write the grammar for statements:

paren_expr = '(' expr ')';

print = 'print' paren_expr ';';

assignment = id '=' expr ';';

statement = print | assignment;

program = {statement};

In the abstract syntax we will now have an interface Stmt for any kind of statement, along with types Assign and Print for print and assignment statements. We will also have a type Block that holds a series of statements; a top-level program will be a Block.

interface Stmt { }

record Assign(string id, Expr e) : Stmt;

record Print(Expr e) : Stmt;

record Block(List<Stmt> stmts) : Stmt;

We can easily enhance our parser to handle statements. For example, we will add a parse_print() function that parses a print statement:

    public Stmt? parse_print() {
        if (match("print")) {
            Print p = new(parse_paren_expr());
            assert(match(";"));
            return p;
        }
        return null;
    }

Notice that this function returns a Stmt object if it can parse one, otherwise null. Our function parse_statement() for parsing any sort of statement looks like this:

public Stmt parse_statement() =>
        parse_print() ?? parse_assignment();

The ?? operator will return its left operand if it is not null, otherwise its right operand. It is handy for chaining together two or more alternatives in a parser.

Here is the complete parser for P2 phase 3 (about 100 lines of code).

We must now enhance our interpreter to handle statements. Because we have variables, we must now evaluate expressions and run statements in the context of an environment that holds the current value of each variable. We will represent an environment using a C# dictionary:

using Env = System.Collections.Generic.Dictionary<string, Value>;

The eval() method will now take an Env as an argument:

interface Expr {
    Value eval(Env env);
}

Evaluating a Var looks up its value in the environment:

record Var(string s) : Expr {
    public Value eval(Env env) => env[s];
}

The Stmt interface has a method run(Env env) that runs a statement in a given environment. Each type of statement has its own implementation of this method. For example, an Assign statement assigns a value to a variable in the environment:

record Assign(string id, Expr e) : Stmt {
    public void run(Env env) {
        env[id] = e.eval(env);
    }
}

A Block is a compound statement, so running a Block runs all the statements it contains:

record Block(List<Stmt> stmts) : Stmt {
    public void run(Env env) {
        foreach (Stmt stmt in stmts)
            stmt.run(env);
    }
}

Here is the complete interpreter for P2 phase 3 including a Main method (about 85 lines of code). It works with the phase 3 parser linked to above.

phase 4: if and while statements

In this phase we will complete our P2 implementation by adding the if and while statements. This will greatly increase the power of the language: with conditionals and loops we can write many sorts of program.

Actually this is the easiest of our four implementation phases! With all of the infrastructure that we built in the preceding 3 phases, our changes in this phase will be straightforward. Here are the grammar enhancements we need:

block = '{' {statement} '}';

if = 'if' paren_expr block;

while = 'while' paren_expr block;

statement = if | while | print | assignment;

We will extend our abstract syntax with types If and While for representing if and while statements:

record If(Expr e, Stmt stmt) : Stmt;

record While(Expr e, Stmt stmt) : Stmt;

We can enhance the recursive-descent parser to handle these statements in a straightforward way. Our parse_statement() function now looks like this:

    public Stmt parse_statement() =>
        parse_if() ?? parse_while() ?? parse_print() ?? parse_assignment();

Here is the complete parser for our final P2 implementation (about 110 lines of code).

Finally, we must implement the run() method for if and while statements. That is not difficult:

record If(Expr e, Stmt stmt) : Stmt {
    public void run(Env env) {
        if (e.eval(env).as_bool())
            stmt.run(env);
    }
}

record While(Expr e, Stmt stmt) : Stmt {
    public void run(Env env) {
        while (e.eval(env).as_bool())
            stmt.run(env);
    }
}

Notice that we implement P2's if statement using a C# if statement, and implement P2's while statement using a C# while statement! That works because these statements exist in both P2 and C# and are essentially identical in the two languages.

Here is the complete interpreter for our final P2 implementation including a Main method (about 100 lines of code). It works with the parser linked to above.

At last our implementation is complete. We can now run the following P2 program to print all prime numbers below 100:

n = 2;
while (n < 100) {
    i = 2;
    prime = true;
    while (i * i <= n) {
        x = n % i;
        if (n % i == 0) {
            prime = false;
        }
        i = i + 1;
    }
    if (prime) {
        print(n);
    }
    n = n + 1;
}

The output is

$ dotnet run primes.p2
2
3
5
7
11
...
83
89
97
$