Regular expressions are a mini-language for matching text. The commonly used regular expression syntax originated in Unix systems in the 1970s, though regular expressions have a theoretical basis in finite automata which was studied well before that. They are very commonly used in text processing, and all major languages including C# have regular expression libraries.
You may have already encountered regular expressions in your UNIX class, since they are supported by command-line utilities such as 'sed'.
The syntax of regular expressions is mostly identical in various implementations of regular expressions, though some details can vary from implementation to implementation. The most basic elements of regular expressions are as follows:
.
[…]
[ace]
will
match the characters 'a', 'c', or 'e'. A set may also contain
character ranges; for example, [0-9a-dA-D]
will match any decimal digit, the lowercase characters from 'a' to
'd', or the uppercase characters from 'A' to 'D'.[^…]
\d
\w
?
|
expr2^
$
In addition, any character that is not listed above will match itself.
For example:
\d\d\d-\d\d\d-\d\d\d\d
will match a US telephone number such as 228-334-9292
.
\d+(\.\d+)?
will
match a decimal number with an optional fractional part, such as 123
or 225.682. Notice that we must write \.
to match only a decimal point, since .
will normally match any character.
[a-z]+\s+[a-z]+\s+[a-z]+
will match three words, separated by whitespace.
\d\d-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d\d\d\d
will match a date such as 22-May-2021.
Our quick library reference lists useful classes and methods in the System.Text.RegularExpressions namespace. The function Regex.IsMatch() returns true if a match for a pattern is found anywhere in an input string:
Regex.IsMatch("one two three 123 four five", @"\d\d\d") // returns true
Notice that the string containing the regular expression above is
preceded by a @
character. That
indicates that it is a verbatim string literal, in which
backslashes are normal characters. For example, \n
in a verbatim literal indicates two characters: a backslash and an n.
(In an ordinary C# string literal, this would indicate a newline
character). I generally recommend that you write all regular
expressions in C# using verbatim literals. In ordinary string
literals you would need to write each backslash twice (\\
)
to prevent it from being interpreted as an escape character, which
would make your regular expressions harder to read.
The function Regex.Match() looks for a match in an input string, and returns a Match object. If the match succeeds, the returned object's Success property will be true, and the Value property will contain the text that matched:
Match m = Regex.Match("one two three 123 four five", @"\d\d\d"); if (m.Success) WriteLine(m.Value);
The function Regex.Matches() looks for all matches of a pattern in an input string, and returns a MatchCollection object that lists all the matches. A MatchCollection is iterable, and also allows you to access individual matches by index:
MatchCollection matches = Regex.Matches("dog cat pig horse bear", @"[^aeiou ][aeiou][^aeiou ]"); WriteLine(matches[0].Value); // writes "dog" WriteLine(matches[1].Value); // writes "cat"
If a regular expression contains one or more groups, then the text that matched each group is accessible through a Match object's Groups collection:
Match m = Regex.Match("name = fred, age = 30", @"name = (\w+), age = (\w+)"); if (m.Success) { WriteLine(m.Groups[1].Value); // writes "fred" WriteLine(m.Groups[2].Value); // write "30" }
Sometimes we write classes whose only purpose is to hold data. For example, consider a class that holds a geometric circle with a center and radius. We might define it like this:
class Circle { public readonly double x, y; public readonly double radius; public Circle(double x, double y, double radius) { (this.x, this.y) = (x, y); this.radius = radius; } public override bool Equals(object? o) => o is Circle d && (x, y, radius) == (d.x, d.y, d.radius); public override int GetHashCode() => (x, y, radius).GetHashCode(); public override String ToString() => $"Circle: x = {x}, y = {y}, radius = {radius}"; }
(Above, the readonly
attribute indicates
that a field may be set only in a constructor, and is immutable after
that.)
This is a lot of code to type to define a simple data class. Fortunately C# offers a more compact alternative, namely records. Instead of the definition above, we may define our circle class using a record:
record Circle(double x, double y, double radius);
That was a lot less typing! :) When you define a record, C# synthesizes a class that is similar to the class above. In particular:
The class is immutable: fields cannot be changed after an instance is initialized.
The class supports comparison by structural equality: two instances are equal if they have the same members (like in the definition of Equals() above).
The class has a GetHashCode() method so that its instances can be used as a hash key.
The class has a ToString() method that builds a string including all field values.
A record may have methods or other members such as properties and indexers. For example:
record Circle(double x, double y, double radius) { public double circumference() => 2 * PI * radius; public double area() => PI * PI * radius; }
As an extended exercise, in the lecture we wrote C# code to extract data about popular movies from the IMDB web site and plot it using the ScottPlot library.
Specifically, we visited the IMDb Top
250 Movies page in a browser, typed Ctrl+A to select the entire
text of the page, typed Ctrl+C to copy the text, then saved the text
to a file called top_movies
. After that,
we wrote the following program that uses regular expressions to
extract data from the text, then produces several plots using
ScottPlot.
using static System.Console; using System.Text.RegularExpressions; using ScottPlot; using ScottPlot.Statistics; record Movie(string name, int year, int length, double score); class Prog { static List<Movie> readMovies() { using StreamReader sr = new("top_movies"); List<Movie> movies = []; while (sr.ReadLine() is string line) { Match m = Regex.Match(line, @"^\d+\. (.*)"); if (m.Success) { string name = m.Groups[1].Value; string line2 = sr.ReadLine()!; Match m2 = Regex.Match(line2, @"^(\d\d\d\d)((\d)h)? ?((\d\d?)m)?"); int year = int.Parse(m2.Groups[1].Value); int length = 0; string hours = m2.Groups[3].Value; if (hours != "") length += 60 * int.Parse(hours); string minutes = m2.Groups[5].Value; if (minutes != "") length += int.Parse(minutes); string line3 = sr.ReadLine()!; Match m3 = Regex.Match(line3, @"^(\d\.\d)"); double score = double.Parse(m3.Groups[1].Value); movies.Add(new Movie(name, year, length, score)); } } WriteLine($"parsed {movies.Count} movies"); return movies; } static void plot(List<Movie> movies, Func<Movie, double> f, string fName, Func<Movie, double> g, string gName) { var xs = movies.Select(f).ToArray(); var ys = movies.Select(g).ToArray(); Plot plot = new(); plot.Add.ScatterPoints(xs, ys); plot.XLabel(fName); plot.YLabel(gName); LinearRegression reg = new(xs, ys); double minX = xs.Min(), maxX = xs.Max(); Coordinates p1 = new(minX, reg.GetValue(minX)), p2 = new(maxX, reg.GetValue(maxX)); PlottableAdder pa = plot.Add; var line = plot.Add.Line(p1, p2); line.LegendText = reg.FormulaWithRSquared; plot.ShowLegend(); plot.SaveSvg($"{fName}_{gName}.svg", 1000, 800); } static void Main() { List<Movie> movies = readMovies(); plot(movies, m => m.year, "year", m => m.score, "score"); plot(movies, m => m.year, "year", m => m.length, "length"); plot(movies, m => m.length, "length", m => m.score, "score"); } }
For each plot, the program uses ScottPlot's LinearRegression class to compute a regression line that fits the data as closely as possible, minimizing the sum of the squared error for each point. That helps us see the strength of the relationship between the variables on the plot. Here is one of the plots that the program produces, showing that longer films in this dataset tend to be rated more highly: