Week 14: Notes

regular expressions

Regular expressions are a mini-language for matching text. The commonly used regular expression syntax originated in Unix systems in the 1970s, though regular expressions have a theoretical basis in finite automata which was studied well before that. They are very commonly used in text processing, and all major languages including C# have regular expression libraries.

You may have already encountered regular expressions in your UNIX class, since they are supported by command-line utilities such as 'sed'.

The syntax of regular expressions is mostly identical in various implementations of regular expressions, though some details can vary from implementation to implementation. The most basic elements of regular expressions are as follows:

.
Match any character except a newline.
[…]
Match any of a set of characters. Characters may be listed individually; for example, [ace] will match the characters 'a', 'c', or 'e'. A set may also contain character ranges; for example, [0-9a-dA-D] will match any decimal digit, the lowercase characters from 'a' to 'd', or the uppercase characters from 'A' to 'D'.
[^…]
Match any character that is not in the given set.
\d
Match any decimal digit.
\s
Match any whitespace character.
\w
Match any alphanumeric character, including underscores.
*
Match 0 or more repetitions of the preceding expression.
+
Match 1 or more repetitions of the preceding expression.
?
Match 0 or 1 repetitions of the preceding expression.
expr1 | expr2
Matches either expr1 or expr2.
(…)
Indicates the start and end of a group.
^
Anchor to the start of the source string.
$
Anchor to the end of the source string.

In addition, any character that is not listed above will match itself.

For example:

regular expressions in C#

Our quick library reference lists useful classes and methods in the System.Text.RegularExpressions namespace. The function Regex.IsMatch() returns true if a match for a pattern is found anywhere in an input string:

Regex.IsMatch("one two three 123 four five", @"\d\d\d")  // returns true

Notice that the string containing the regular expression above is preceded by a @ character. That indicates that it is a verbatim string literal, in which backslashes are normal characters. For example, \n in a verbatim literal indicates two characters: a backslash and an n. (In an ordinary C# string literal, this would indicate a newline character). I generally recommend that you write all regular expressions in C# using verbatim literals. In ordinary string literals you would need to write each backslash twice (\\) to prevent it from being interpreted as an escape character, which would make your regular expressions harder to read.

The function Regex.Match() looks for a match in an input string, and returns a Match object. If the match succeeds, the returned object's Success property will be true, and the Value property will contain the text that matched:

Match m = Regex.Match("one two three 123 four five", @"\d\d\d");
if (m.Success)
    WriteLine(m.Value);

The function Regex.Matches() looks for all matches of a pattern in an input string, and returns a MatchCollection object that lists all the matches. A MatchCollection is iterable, and also allows you to access individual matches by index:

MatchCollection matches =
    Regex.Matches("dog cat pig horse bear", @"[^aeiou ][aeiou][^aeiou ]");
WriteLine(matches[0].Value);    // writes "dog"
WriteLine(matches[1].Value);    // writes "cat"

If a regular expression contains one or more groups, then the text that matched each group is accessible through a Match object's Groups collection:

Match m = Regex.Match("name = fred, age = 30", @"name = (\w+), age = (\w+)");
if (m.Success) {
    WriteLine(m.Groups[1].Value);   // writes "fred"
    WriteLine(m.Groups[2].Value);   // write "30"
}

records

Sometimes we write classes whose only purpose is to hold data. For example, consider a class that holds a geometric circle with a center and radius. We might define it like this:

class Circle {
    public readonly double x, y;
    public readonly double radius;

    public Circle(double x, double y, double radius) {
        (this.x, this.y) = (x, y);
        this.radius = radius;
    }

    public override bool Equals(object? o) =>
        o is Circle d && (x, y, radius) == (d.x, d.y, d.radius);

    public override int GetHashCode() => (x, y, radius).GetHashCode();

    public override String ToString() =>
        $"Circle: x = {x}, y = {y}, radius = {radius}";
}

(Above, the readonly attribute indicates that a field may be set only in a constructor, and is immutable after that.)

This is a lot of code to type to define a simple data class. Fortunately C# offers a more compact alternative, namely records. Instead of the definition above, we may define our circle class using a record:

record Circle(double x, double y, double radius);

That was a lot less typing! :) When you define a record, C# synthesizes a class that is similar to the class above. In particular:

A record may have methods or other members such as properties and indexers. For example:

record Circle(double x, double y, double radius) {
    public double circumference() => 2 * PI * radius;
    
    public double area() => PI * PI * radius;
}

extracting data from text
plotting
least-squares regression

As an extended exercise, in the lecture we wrote C# code to extract data about popular movies from the IMDB web site and plot it using the ScottPlot library.

Specifically, we visited the IMDb Top 250 Movies page in a browser, typed Ctrl+A to select the entire text of the page, typed Ctrl+C to copy the text, then saved the text to a file called top_movies. After that, we wrote the following program that uses regular expressions to extract data from the text, then produces several plots using ScottPlot.

using static System.Console;
using System.Text.RegularExpressions;

using ScottPlot;
using ScottPlot.Statistics;

record Movie(string name, int year, int length, double score);

class Prog {
    static List<Movie> readMovies() {
        using StreamReader sr = new("top_movies");
        List<Movie> movies = [];

        while (sr.ReadLine() is string line) {
            Match m = Regex.Match(line, @"^\d+\. (.*)");
            if (m.Success) {
                string name = m.Groups[1].Value;

                string line2 = sr.ReadLine()!;
                Match m2 = Regex.Match(line2, @"^(\d\d\d\d)((\d)h)? ?((\d\d?)m)?");
                int year = int.Parse(m2.Groups[1].Value);

                int length = 0;
                string hours = m2.Groups[3].Value;
                if (hours != "")
                    length += 60 * int.Parse(hours);
                string minutes = m2.Groups[5].Value;
                if (minutes != "")
                    length += int.Parse(minutes);

                string line3 = sr.ReadLine()!;
                Match m3 = Regex.Match(line3, @"^(\d\.\d)");
                double score = double.Parse(m3.Groups[1].Value);

                movies.Add(new Movie(name, year, length, score));
            }
        }
        WriteLine($"parsed {movies.Count} movies");
        return movies;
    }

    static void plot(List<Movie> movies, Func<Movie, double> f, string fName,
                                         Func<Movie, double> g, string gName) {
        var xs = movies.Select(f).ToArray();
        var ys = movies.Select(g).ToArray();

        Plot plot = new();
        plot.Add.ScatterPoints(xs, ys);
        plot.XLabel(fName);
        plot.YLabel(gName);

        LinearRegression reg = new(xs, ys);
        double minX = xs.Min(), maxX = xs.Max();
        Coordinates p1 = new(minX, reg.GetValue(minX)), p2 = new(maxX, reg.GetValue(maxX));
        PlottableAdder pa = plot.Add;
        var line = plot.Add.Line(p1, p2);
        line.LegendText = reg.FormulaWithRSquared;
        plot.ShowLegend();

        plot.SaveSvg($"{fName}_{gName}.svg", 1000, 800);
    }

    static void Main() {
        List<Movie> movies = readMovies();
        plot(movies, m => m.year, "year", m => m.score, "score");
        plot(movies, m => m.year, "year", m => m.length, "length");
        plot(movies, m => m.length, "length", m => m.score, "score");
    }
}

For each plot, the program uses ScottPlot's LinearRegression class to compute a regression line that fits the data as closely as possible, minimizing the sum of the squared error for each point. That helps us see the strength of the relationship between the variables on the plot. Here is one of the plots that the program produces, showing that longer films in this dataset tend to be rated more highly: