Regular expressions are a mini-language for matching text. The commonly used regular expression syntax originated in Unix systems in the 1970s, though regular expressions have a theoretical basis in finite automata which was studied well before that. They are very commonly used in text processing, and all major languages including C# have regular expression libraries.
You may have already encountered regular expressions in your UNIX class, since they are supported by command-line utilities such as 'sed'.
The syntax of regular expressions is mostly identical in various implementations of regular expressions, though some details can vary from implementation to implementation. The most basic elements of regular expressions are as follows:
.
[…]
[ace]
will
match the characters 'a', 'c', or 'e'. A set may also contain
character ranges; for example, [0-9a-dA-D]
will match any decimal digit, the lowercase characters from 'a' to
'd', or the uppercase characters from 'A' to 'D'.[^…]
\d
\w
?
|
expr2^
$
In addition, any character that is not listed above will match itself.
For example:
\d\d\d-\d\d\d-\d\d\d\d
will match a US telephone number such as 228-334-9292
.
\d+(\.\d+)?
will
match a decimal number with an optional fractional part, such as 123
or 225.682. Notice that we must write \.
to match only a decimal point, since .
will normally match any character.
[a-z]+\s+[a-z]+\s+[a-z]+
will match three words, separated by whitespace.
\d\d-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)-\d\d\d\d
will match a date such as 22-May-2021.
Our quick
library reference lists useful classes and methods in the
System.Text.RegularExpressions
namespace. The function Regex.IsMatch()
returns true if a match for a pattern is found anywhere in an input
string:
Regex.IsMatch("one two three 123 four five", @"\d\d\d") // returns true
Notice that the string containing the regular expression above is
preceded by a @
character. That
indicates that it is a verbatim string literal, in which
backslashes are normal characters. For example, \n
in a verbatim literal indicates two characters: a backslash and an n.
(In an ordinary C# string literal, this would indicate a newline
character). I generally recommend that you write all regular
expressions in C# using verbatim literals. In ordinary string
literals you would need to write each backslash twice (\\
)
to prevent it from being interpreted as an escape character, which
would make your regular expressions harder to read.
The function Regex.Match()
looks for a match in an input string, and returns a Match
object. If the match succeeds, the returned object's Success
property will be true, and the Value
property will contain the text that matched:
Match m = Regex.Match("one two three 123 four five", @"\d\d\d"); if (m.Success) WriteLine(m.Value);
The function Regex.Matches()
looks for
all matches of a pattern in an input string, and returns a
MatchCollection
object that lists all
the matches. A MatchCollection
is
iterable, and also allows you to access individual matches by index:
MatchCollection matches = Regex.Matches("dog cat pig horse bear", @"[^aeiou ][aeiou][^aeiou ]"); WriteLine(matches[0].Value); // writes "dog" WriteLine(matches[1].Value); // writes "cat"
If a regular expression contains one or more groups, then the text
that matched each group is accessible through a Match
object's Groups
collection:
Match m = Regex.Match("name = fred, age = 30", @"name = (\w+), age = (\w+)"); if (m.Success) { WriteLine(m.Groups[1].Value); // writes "fred" WriteLine(m.Groups[2].Value); // write "30" }
Usually it's best to write a program in the simplest possible way at first, then optimize it later if necessary. If you try to make all parts of your program as fast as possible when you first write it, you may end up doing a lot of unnecessary work and making the program unnecessarily complicated. For this reason, the famous computer scientist Donald Knuth once said that "premature optimization is the root of all evil".
If you do need to optimize the code to make it faster, you should be aware that in most programs a small part of the code is the performance bottleneck. Optimizing other part of the code will probably have no discernable impact on performance. So if you want to make your program faster, you need to find out where that bottleneck is.
One quick and easy technique for finding your program's bottleneck is to press Ctrl+C in a debugger, then examine the call stack. It is likely (though not guaranteed) that at the moment you pressed Ctrl+C, execution was in the region of your code that is executed most frequently, i.e. the bottleneck.
Profiling code means analying its performance to find out which regions of code are using the most time and/or memory. You can manually profile your code by adding timing code that measures the execution time of functions or loops. This usually isn't too hard, and sometimes is the most flexible approach.
Alternatively, various tools exist that can profile your program, showing how much time is spent in each function or method. Some tools are based on sampling: they periodically interrupt your program to see where the call stack is at that moment. Other tools will instrument your code by adding instructions that measure timing.
sysprof is a useful system-wide sampling profiler for Linux systems.