Week 14: Notes

regular expressions

Regular expressions are a mini-language for matching text. The commonly used regular expression syntax originated in Unix systems in the 1970s, though regular expressions have a theoretical basis in finite automata which was studied well before that. They are very commonly used in text processing, and all major languages including C# have regular expression libraries.

You may have already encountered regular expressions in your UNIX class, since they are supported by command-line utilities such as 'sed'.

The syntax of regular expressions is mostly identical in various implementations of regular expressions, though some details can vary from implementation to implementation. The most basic elements of regular expressions are as follows:

.
Match any character except a newline.
[…]
Match any of a set of characters. Characters may be listed individually; for example, [ace] will match the characters 'a', 'c', or 'e'. A set may also contain character ranges; for example, [0-9a-dA-D] will match any decimal digit, the lowercase characters from 'a' to 'd', or the uppercase characters from 'A' to 'D'.
[^…]
Match any character that is not in the given set.
\d
Match any decimal digit.
\s
Match any whitespace character.
\w
Match any alphanumeric character, including underscores.
*
Match 0 or more repetitions of the preceding expression.
+
Match 1 or more repetitions of the preceding expression.
?
Match 0 or 1 repetitions of the preceding expression.
expr1 | expr2
Matches either expr1 or expr2.
(…)
Indicates the start and end of a group.
^
Anchor to the start of the source string.
$
Anchor to the end of the source string.

In addition, any character that is not listed above will match itself.

For example:

regular expressions in C#

Our quick library reference lists useful classes and methods in the System.Text.RegularExpressions namespace. The function Regex.IsMatch() returns true if a match for a pattern is found anywhere in an input string:

Regex.IsMatch("one two three 123 four five", @"\d\d\d")  // returns true

Notice that the string containing the regular expression above is preceded by a @ character. That indicates that it is a verbatim string literal, in which backslashes are normal characters. For example, \n in a verbatim literal indicates two characters: a backslash and an n. (In an ordinary C# string literal, this would indicate a newline character). I generally recommend that you write all regular expressions in C# using verbatim literals. In ordinary string literals you would need to write each backslash twice (\\) to prevent it from being interpreted as an escape character, which would make your regular expressions harder to read.

The function Regex.Match() looks for a match in an input string, and returns a Match object. If the match succeeds, the returned object's Success property will be true, and the Value property will contain the text that matched:

Match m = Regex.Match("one two three 123 four five", @"\d\d\d");
if (m.Success)
    WriteLine(m.Value);

The function Regex.Matches() looks for all matches of a pattern in an input string, and returns a MatchCollection object that lists all the matches. A MatchCollection is iterable, and also allows you to access individual matches by index:

MatchCollection matches =
    Regex.Matches("dog cat pig horse bear", @"[^aeiou ][aeiou][^aeiou ]");
WriteLine(matches[0].Value);    // writes "dog"
WriteLine(matches[1].Value);    // writes "cat"

If a regular expression contains one or more groups, then the text that matched each group is accessible through a Match object's Groups collection:

Match m = Regex.Match("name = fred, age = 30", @"name = (\w+), age = (\w+)");
if (m.Success) {
    WriteLine(m.Groups[1].Value);   // writes "fred"
    WriteLine(m.Groups[2].Value);   // write "30"
}

optimization and profiling

Usually it's best to write a program in the simplest possible way at first, then optimize it later if necessary. If you try to make all parts of your program as fast as possible when you first write it, you may end up doing a lot of unnecessary work and making the program unnecessarily complicated. For this reason, the famous computer scientist Donald Knuth once said that "premature optimization is the root of all evil".

If you do need to optimize the code to make it faster, you should be aware that in most programs a small part of the code is the performance bottleneck. Optimizing other part of the code will probably have no discernable impact on performance. So if you want to make your program faster, you need to find out where that bottleneck is.

One quick and easy technique for finding your program's bottleneck is to press Ctrl+C in a debugger, then examine the call stack. It is likely (though not guaranteed) that at the moment you pressed Ctrl+C, execution was in the region of your code that is executed most frequently, i.e. the bottleneck.

Profiling code means analying its performance to find out which regions of code are using the most time and/or memory. You can manually profile your code by adding timing code that measures the execution time of functions or loops. This usually isn't too hard, and sometimes is the most flexible approach.

Alternatively, various tools exist that can profile your program, showing how much time is spent in each function or method. Some tools are based on sampling: they periodically interrupt your program to see where the call stack is at that moment. Other tools will instrument your code by adding instructions that measure timing.

sysprof is a useful system-wide sampling profiler for Linux systems.