Week 13: Notes

regular expressions

Regular expressions are a mini-language for matching text. The commonly used regular expression syntax originated in Unix systems in the 1970s, though regular expressions have a theoretical basis in finite automata which was studied well before that. They are a practical tool that is commonly used in text processing, and all major languages including Python have regular expression libraries.

The syntax of regular expressions is mostly identical in various languages, though some details can vary from implementation to implementation. The most basic elements of regular expressions are as follows:

.: Match any character except a newline.
[…]: Match any of a set of characters. Characters may be listed individually; for example, [ace] will match the characters 'a', 'c', or 'e'. A set may also contain character ranges; for example, [0-9a-dA-D] will match any decimal digit, the lowercase characters from 'a' to 'd', or the uppercase characters from 'A' to 'D'.
[^…]: Match any character that is not in the given set.
*: Match 0 or more repetitions of the preceding expression.
+: Match 1 or more repetitions of the preceding expression.
?: Match 0 or 1 repetitions of the preceding expression.
expr1 | expr2: Matches either expr1 or expr2.
(…): Indicates the start and end of a group.
^: Anchor to the start of the source string.
$: Anchor to the end of the source string.
\b: Anchor to a word boundary, i.e. the position where a word (i.e. a sequence of characters that \w will match) starts or ends.
\d: Match any decimal digit.
\s: Match any whitespace character.
\w: Match any alphanumeric character, including underscores.

In addition, any character that is not listed above will match itself.

For example:

a*b+ will match a sequence of 0 or more a characters, followed by a sequence of 1 or more b characters. For example, this regular expression will match aabbb or bb, but not aacbb or aa.
[A-Za-z]+ will match any word consisting of uppercase or lowercase letters.
[a-z]+\s+[a-z]+\s+[a-z]+ will match three lowercase words, separated by whitespace.

regular expressions in Python

Our quick library reference lists useful classes and methods in the re module. The function re.findall() takes a regular expression and a string to search, and returns a list of matches, each of which is a string:

>>> re.findall('[A-Za-z]+', 'Clear:blue_sky')
['Clear', 'blue', 'sky']

As an example, let's use regular expressions to find all the words in the file alice1.txt, which is the first chapter of Alice in Wonderland. It begins like this:

CHAPTER I

[Sidenote: _Down the Rabbit-Hole_]

ALICE was beginning to get very tired of sitting by her
sister on the bank, and of having nothing to do: once or twice she had
peeped into the book her sister was reading, but it had no pictures or
conversations in it, "and what is the use of a book," thought Alice,
"without pictures or conversations?"

We'll put all the words into a set, then print them in sorted order:

import re

words = set()

for line in open('alice1.txt'):
    for s in re.findall('[A-Za-z]+', line):
        words.add(s)

print(f'found {len(words)} words')
for w in sorted(words):
    print(w)

The program prints

found 639 words
ALICE
After
Alice
And
Antipathies
...
written
yes
you
your

Suppose that we want to find all words that begin with a capital letter. Let's modify the regular expression above as follows:

for s in re.findall('[A-Z][A-Za-z]+', line):

Now we get this output:

found 52 words
ALICE
After
Alice
And
...
White
Why
Would
Zealand

Now let's attempt to match all words that contain only capital letters. We might try this:

for s in re.findall('[A-Z]+', line):

Unfortunately this does not produce the output we expect:

found 25 words
A
ALICE
B
C
CHAPTER
D
...

"B", "C" and "D" are not words. The problem is that the regular expression above will match a single letter at the beginning of a word. For example:

>>> re.findall('[A-Z]+', 'One Two Three')
['O', 'T', 'T']

To prevent that, we can use \b, which matches only at a word boundary, i.e. the beginning or end of a word:

for s in re.findall(r'\b[A-Z]+\b', line):

Now the program produces

found 8 words
ALICE
CHAPTER
DRINK
EAT
I
MARMALADE
ME
ORANGE

Notice that the string in the call to re.findall() above is preceded with an "r" character. That indicates that it is a raw string, in which backslashes are normal characters. For example, \n in a raw string indicates two characters: a backslash and an n. (In an ordinary Python string, this would indicate a newline character.) In ordinary strings you would need to write each backslash twice (\\) to prevent it from being interpreted as an escape character:

for s in re.findall('\\b[A-Z]+\\b', line):

That is also possible, but will make your regular expressions harder to read.

The text contains some emphasized words that are surrounded by underscores, such as "_very_" in this line:

There was nothing so _very_ remarkable in that; nor did Alice think it

Let's use a regular expression to capture all such words:

    for s in re.findall(r'_([A-Za-z]+)_', line):

Notice the parentheses in the regular expression above, which surround a group. If a regular expression contains only one group, then in each match re.findall() will return the text of that group. Our program now produces

found 10 words
curtseying
never
not
one
poison
that
through
very
was
would

If a regular expression has multiple groups, then re.findall() will return a list of tuples. Each tuple will contain text from one group.

A regular expression may contain a backreference such as \1, which matches the text of a preceding group in the expression. \1 matches group 1, \2 matches group 2 and so on. For example, let's modify our expression to find all repeated letters:

    for s in re.findall(r'([A-Za-z])\1', line):

Now the program prints

found 14 words
b
c
d
e
f
g
l
m
n
o
p
r
s
t

Those are the 14 letters that are repeated anywhere in the text. It would be more interesting to find all words that contain repeated letters. Here is a regular expression we can use:

[A-Za-z]*([A-Za-z])\1[A-Za-z]*

This will match a (possibly empty) sequence of initial letters, then a letter that is repeated, and finally a (possibly empty) sequence of letters at the end. However, if we use re.findall() with this expression, we will once again produce only individual letters, since that is what our group contains. Since we want the entire words that match, we can use re.finditer() instead. This function will produce an iteration of match objects. If m is a match object, m[0] will return the entire string that was matched. m[1] returns the contents of group 1, m[2] returns group 2 and so on. We will modify our program as follows:

import re

words = set()

for line in open('alice1.txt'):
    for m in re.finditer(r'[A-Za-z]*([A-Za-z])\1[A-Za-z]*', line):
        words.add(m[0])

print(f'found {len(words)} words')
for w in sorted(words):
    print(w)

Now it produces a list of words containing double letters:

found 107 words
Illustration
Rabbit
Soon
Suddenly
Well
across
actually
...

In this lecture we have really seen only a small taste of what regular expressions can do. They have many more features that you may explore in other classes, or in your own reading online.

dictionaries with defaults

Suppose we have code that builds a dictionary holding the number of occurrences of each character in a string:

    count = {}
    for c in s:
        if c in count:
            count[c] += 1
        else:
            count[c] = 1

In this code it's a bother that we have to check whether each key c is already in the dictionary. As an easier alternative, we can use the defaultdict class that's built into Python's standard library. When we create a defaultdict, we provide it with a default value function. When we look up a key in a defaultdict and it's not found, Python will call this function, passing no arguments. The function will return a default value, and Python will then automatically add a mapping from the key to that value. For example:

>>> from collections import defaultdict
>>> count = defaultdict(lambda: 0)
>>> count['red']
0
>>> count['blue'] += 1
>>> count['blue']
1

Note that instead of lambda: 0 we could just write int, since the built-in int() function just returns 0, the default value of an integer:

>>> int()
0

Using a defaultdict, we can rewrite the character-counting code above more easily:

from collections import defaultdict

count = defaultdict(int)
for c in s:
    count[c] += 1

word frequency

In the lecture we wrote a program that used a regular expression to find all words in the entire text of Alice in Wonderland, and counted the frequency (= number of occurrences) of each word in the text. We orderd the words by rank, where the most common word ("the") has rank 1, the second most common word has rank 2 and so on. Then we generated a plot with rank on the X-axis, and inverse frequency (i.e. 1 / frequency) on the Y-axis. Here is the code we wrote:

from collections import defaultdict
import re
from matplotlib import pyplot as plt

freq = defaultdict(int)

for line in open('alice.txt'):
    for s in re.findall(r"[A-Za-z']+", line):
        freq[s] += 1

freqs = list(freq.values())
freqs.sort(reverse = True)
print(freqs)

plt.plot([x + 1 for x in range(len(freqs))],
         [1 / f for f in freqs])
plt.xlabel('rank')
plt.ylabel('1 / frequency')
plt.show()

The code produces this plot:

We see that the curve at the lower left looks nearly like a straight line. This illustrates a principle from linguistics known as Zipf's law, which says that in a large text the frequency (i.e. number of occurrences) of a word tends to be inversely correlated with its rank, i.e.

word frequency ~ (1 / word rank)

where the symbol ~ means "is proportional to". You may learn more about Zipf's law and related principles in a linguistics class.

Notice that in the chart above a large number of words (approximately half of those in the text) occur only once: they have an inverse frequency of 1.0, so their actual frequency (number of occurrences) is 1. A word that occurs only one in a text is known as a hapax legomenon. (This term comes from Greek and literally means "spoken once".) Hapax legomena are surprisingly common. You might think that a larger corpus (such as all the works of Shakespeare, or all novels written in English in the 20th century) would contain a smaller proportion of hapax legomena, but even in such huge corpora they are common. This reflects the fact that languages such as English have an enormous number of words and it's even difficult to say where a language begins and ends, because many people (perhaps even you!) invent new words for new situations.

writing good code

We now know enough Python to write larger programs. Especially when writing a larger program, we want to write code in a way that is structured, clear, and maintainable. Functions and classes are essential tools for this purpose.

Here are three general rules to follow for writing good code:

1. Don't repeat yourself.

Beginning programmers often write code that looks like this:

if x > 0:
    … 20 lines of code …
else:
    … the same 20 lines of code, with a few small changes …

Or like this:

if key == 'w':   # up
    … 12 lines of code …
elif key == 'x':  # down
    … 12 very similar lines of code …
elif key == 'a':  # left
    … 12 very similar lines of code …
elif key == 'r':  # right
    … 12 very similar lines of code …

This code is poor. It is hard to read: the differences between the parallel blocks of code may be important, but hard to spot. And it is hard to maintain. Every time you change one of the parallel blocks of code, you must change the other(s) in the same way. That is a chore, and it's also easy to forget to do that (or to do it incorrectly).

In some cases, you can eliminate this redundancy by writing code in a more general way, e.g. by looping over four possible direction vectors rather than having four blocks of code for the cases up, left, right, and down. Another useful tool is functions. You can often factor out parallel blocks of code into a function with parameters representing the (possibly small) differences in behavior between the parallel blocks. Then in place of the first code example above you can write

if x > 0:
    my_fun(… arguments …)
else:
    my_fun(… different arguments …)

2. Make variables as local as possible.

Broadly speaking, a local variable is better than an attribute, and an attribute is better than a global variable.

The scope of a variable is the portion of a program's text in which the variable may be read or written. A local variable's scope is the function in which it appears. An attribute's scope is the class in which the attribute appears, at least if you only set the attribute in code in that class (which many languages will enforce, though not Python) A global variable's scope is the entire program. Another way of phrasing this rule is that a variable should have as small a scope as possible, which makes a program easier to understand since there are fewer places where a variable's value might change.

This rule does not apply to constants, i.e. read-only data. A global constant is unproblematic; it will never change, so its meaning is completely clear.

3. Every line and function should fit on a single screen.

I recommend limiting lines to about 100 characters. Lines that are longer than this are hard to read, especially since they may wrap in an editor window. I also recommend that you configure your editor to display a vertical gray line at the 100^th column position, so that you can see if you've written a line that is too long. In Visual Studio Code, you can do that via the "editor.rulers" setting. Specifically, you can add this line to your settings.json file:

"editor.rulers": [ 100 ],

In my opinion, functions should generally be limited to about 50 lines. Functions that are much longer than this quickly become hard to understand, and are hard to read since you may have to scroll up and down to see them.