Regular expressions are a mini-language for matching text. The commonly used regular expression syntax originated in Unix systems in the 1970s, though regular expressions have a theoretical basis in finite automata which was studied well before that. They are a practical tool that is commonly used in text processing, and all major languages including Python have regular expression libraries.
The syntax of regular expressions is mostly identical in various languages, though some details can vary from implementation to implementation. The most basic elements of regular expressions are as follows:
.[…][ace] will
match the characters 'a', 'c', or 'e'. A set may also contain
character ranges; for example, [0-9a-dA-D]
will match any decimal digit, the lowercase characters from 'a' to
'd', or the uppercase characters from 'A' to 'D'.[^…]?| expr2^$\b\w
will match) starts or ends.
\d\wIn addition, any character that is not listed above will match itself.
For example:
a*b+ will match
a sequence of 0 or more a characters,
followed by a sequence of 1 or more b characters. For example, this
regular expression will match aabbb or
bb, but not aacbb
or aa.
[A-Za-z]+ will
match any word consisting of uppercase or lowercase letters.
[a-z]+\s+[a-z]+\s+[a-z]+
will match three lowercase words, separated by whitespace.
Our quick library
reference lists useful classes and methods in the re
module. The function re.findall()
takes a regular expression and a string to search, and returns a list
of matches, each of which is a string:
>>> re.findall('[A-Za-z]+', 'Clear:blue_sky')
['Clear', 'blue', 'sky']
As an example, let's use regular expressions to find all the words in
the file alice1.txt,
which is the first chapter of Alice in
Wonderland. It begins like this:
CHAPTER I [Sidenote: _Down the Rabbit-Hole_] ALICE was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?"
We'll put all the words into a set, then print them in sorted order:
import re
words = set()
for line in open('alice1.txt'):
for s in re.findall('[A-Za-z]+', line):
words.add(s)
print(f'found {len(words)} words')
for w in sorted(words):
print(w)The program prints
found 639 words ALICE After Alice And Antipathies ... written yes you your
Suppose that we want to find all words that begin with a capital letter. Let's modify the regular expression above as follows:
for s in re.findall('[A-Z][A-Za-z]+', line):Now we get this output:
found 52 words ALICE After Alice And ... White Why Would Zealand
Now let's attempt to match all words that contain only capital letters. We might try this:
for s in re.findall('[A-Z]+', line):Unfortunately this does not produce the output we expect:
found 25 words A ALICE B C CHAPTER D ...
"B", "C" and "D" are not words. The problem is that the regular expression above will match a single letter at the beginning of a word. For example:
>>> re.findall('[A-Z]+', 'One Two Three')
['O', 'T', 'T']
To prevent that, we can use \b, which
matches only at a word boundary,
i.e. the beginning or end of a word:
for s in re.findall(r'\b[A-Z]+\b', line):
Now the program produces
found 8 words ALICE CHAPTER DRINK EAT I MARMALADE ME ORANGE
Notice that the string in the call to re.findall()
above is preceded with an "r" character. That indicates
that it is a raw string, in which backslashes are normal
characters. For example, \n in a raw
string indicates two characters: a backslash and an n. (In an
ordinary Python string, this would indicate a newline character.) In
ordinary strings you would need to write each backslash twice (\\)
to prevent it from being interpreted as an escape character:
for s in re.findall('\\b[A-Z]+\\b', line):That is also possible, but will make your regular expressions harder to read.
The text contains some emphasized words that are
surrounded by underscores, such as "_very_"
in this line:
There was nothing so _very_ remarkable in that; nor did Alice think it
Let's use a regular expression to capture all such words:
for s in re.findall(r'_([A-Za-z]+)_', line):
Notice the parentheses in the regular expression above, which
surround a group. If a regular expression contains only one
group, then in each match re.findall()
will return the text of that group. Our program now produces
found 10 words curtseying never not one poison that through very was would
If a regular expression has multiple groups, then re.findall()
will return a list of tuples. Each tuple will contain text from one
group.
A regular expression may contain a backreference such as \1, which matches the text of a preceding group in the expression. \1 matches group 1, \2 matches group 2 and so on. For example, let's modify our expression to find all repeated letters:
for s in re.findall(r'([A-Za-z])\1', line):
Now the program prints
found 14 words b c d e f g l m n o p r s t
Those are the 14 letters that are repeated anywhere in the text. It would be more interesting to find all words that contain repeated letters. Here is a regular expression we can use:
[A-Za-z]*([A-Za-z])\1[A-Za-z]*
This will match a (possibly empty) sequence of initial letters, then
a letter that is repeated, and finally a (possibly empty) sequence of
letters at the end. However, if we use re.findall()
with this expression, we will once again produce only individual
letters, since that is what our group contains. Since we want the
entire words that match, we can use re.finditer()
instead. This function will produce an iteration of match objects.
If m is a match object, m[0]
will return the entire string that was matched. m[1]
returns the contents of group 1, m[2]
returns group 2 and so on. We will modify our program as follows:
import re
words = set()
for line in open('alice1.txt'):
for m in re.finditer(r'[A-Za-z]*([A-Za-z])\1[A-Za-z]*', line):
words.add(m[0])
print(f'found {len(words)} words')
for w in sorted(words):
print(w)Now it produces a list of words containing double letters:
found 107 words Illustration Rabbit Soon Suddenly Well across actually ...
In this lecture we have really seen only a small taste of what regular expressions can do. They have many more features that you may explore in other classes, or in your own reading online.
Suppose we have code that builds a dictionary holding the number of occurrences of each character in a string:
count = {}
for c in s:
if c in count:
count[c] += 1
else:
count[c] = 1
In this code it's a bother that we have to check whether each key c
is already in the dictionary. As an easier alternative, we can use
the defaultdict class that's built into
Python's standard library. When we create a defaultdict,
we provide it with a default value function. When we look up a key in
a defaultdict and it's not found, Python
will call this function, passing no arguments. The function will
return a default value, and Python will then automatically add a
mapping from the key to that value. For example:
>>> from collections import defaultdict >>> count = defaultdict(lambda: 0) >>> count['red'] 0 >>> count['blue'] += 1 >>> count['blue'] 1
Note that instead of lambda: 0 we could
just write int, since the built-in int()
function just returns 0, the default value of an integer:
>>> int() 0
Using a defaultdict, we can rewrite the
character-counting code above more easily:
from collections import defaultdict
count = defaultdict(int)
for c in s:
count[c] += 1In the lecture we wrote a program that used a regular expression to find all words in the entire text of Alice in Wonderland, and counted the frequency (= number of occurrences) of each word in the text. We orderd the words by rank, where the most common word ("the") has rank 1, the second most common word has rank 2 and so on. Then we generated a plot with rank on the X-axis, and inverse frequency (i.e. 1 / frequency) on the Y-axis. Here is the code we wrote:
from collections import defaultdict
import re
from matplotlib import pyplot as plt
freq = defaultdict(int)
for line in open('alice.txt'):
for s in re.findall(r"[A-Za-z']+", line):
freq[s] += 1
freqs = list(freq.values())
freqs.sort(reverse = True)
print(freqs)
plt.plot([x + 1 for x in range(len(freqs))],
[1 / f for f in freqs])
plt.xlabel('rank')
plt.ylabel('1 / frequency')
plt.show()The code produces this plot:
We see that the curve at the lower left looks nearly like a straight line. This illustrates a principle from linguistics known as Zipf's law, which says that in a large text the frequency (i.e. number of occurrences) of a word tends to be inversely correlated with its rank, i.e.
word frequency ~ (1 / word rank)
where the symbol ~ means "is proportional to". You may learn more about Zipf's law and related principles in a linguistics class.
Notice that in the chart above a large number of words (approximately half of those in the text) occur only once: they have an inverse frequency of 1.0, so their actual frequency (number of occurrences) is 1. A word that occurs only one in a text is known as a hapax legomenon. (This term comes from Greek and literally means "spoken once".) Hapax legomena are surprisingly common. You might think that a larger corpus (such as all the works of Shakespeare, or all novels written in English in the 20th century) would contain a smaller proportion of hapax legomena, but even in such huge corpora they are common. This reflects the fact that languages such as English have an enormous number of words and it's even difficult to say where a language begins and ends, because many people (perhaps even you!) invent new words for new situations.
We now know enough Python to write larger programs. Especially when writing a larger program, we want to write code in a way that is structured, clear, and maintainable. Functions and classes are essential tools for this purpose.
Here are three general rules to follow for writing good code:
Beginning programmers often write code that looks like this:
if x > 0:
… 20 lines of code …
else:
… the same 20 lines of code, with a few small changes …Or like this:
if key == 'w': # up
… 12 lines of code …
elif key == 'x': # down
… 12 very similar lines of code …
elif key == 'a': # left
… 12 very similar lines of code …
elif key == 'r': # right
… 12 very similar lines of code …This code is poor. It is hard to read: the differences between the parallel blocks of code may be important, but hard to spot. And it is hard to maintain. Every time you change one of the parallel blocks of code, you must change the other(s) in the same way. That is a chore, and it's also easy to forget to do that (or to do it incorrectly).
In some cases, you can eliminate this redundancy by writing code in a more general way, e.g. by looping over four possible direction vectors rather than having four blocks of code for the cases up, left, right, and down. Another useful tool is functions. You can often factor out parallel blocks of code into a function with parameters representing the (possibly small) differences in behavior between the parallel blocks. Then in place of the first code example above you can write
if x > 0:
my_fun(… arguments …)
else:
my_fun(… different arguments …)Broadly speaking, a local variable is better than an attribute, and an attribute is better than a global variable.
The scope of a variable is the portion of a program's text in which the variable may be read or written. A local variable's scope is the function in which it appears. An attribute's scope is the class in which the attribute appears, at least if you only set the attribute in code in that class (which many languages will enforce, though not Python) A global variable's scope is the entire program. Another way of phrasing this rule is that a variable should have as small a scope as possible, which makes a program easier to understand since there are fewer places where a variable's value might change.
This rule does not apply to constants, i.e. read-only data. A global constant is unproblematic; it will never change, so its meaning is completely clear.
I recommend limiting lines to about 100 characters. Lines that are longer than this are hard to read, especially since they may wrap in an editor window. I also recommend that you configure your editor to display a vertical gray line at the 100th column position, so that you can see if you've written a line that is too long. In Visual Studio Code, you can do that via the "editor.rulers" setting. Specifically, you can add this line to your settings.json file:
"editor.rulers": [ 100 ],
In my opinion, functions should generally be limited to about 50 lines. Functions that are much longer than this quickly become hard to understand, and are hard to read since you may have to scroll up and down to see them.