Dark theme

Key ideas in depth: regex


Regex (or "Regular Expressions" to give it its full name), is one of those things beloved by geeks, but so confusing as to be massively offputting. Regex is essentially a way of writing patterns to search for, for example in text or filenames. It isn't specific to Python, or indeed any language, but is rather implemented (more or less) to a standard within most languages. It is one of those things that sooner or later you'll come across.

In Python, it is implemented in the re library. We can broadly use regex in two forms: uncompiled as a search string, or compiled into an object. The later is more nunaced and efficient.

Here's an example that returns all the words in a text that start with a capital "T" (regextest.py; exampletext.py):

import re

f = open("text.txt")
text = f.read()
f.close()

pattern = r"T\w*"
regex = re.compile(pattern)
result = regex.findall(text)

print(result)

The complication with regex comes in building up the regex itself. Here r"T\w*" expands as:
r: take any backslashes as backslashes (a "raw" Python string) rather than escapes
T: anything starting with a "T" (and this doesn't mean the start of words, just anywhere with a "T")
\w: followed by a alphanumeric (letter or number) character (and not anything else like a space or comma)
*: the last thing repeated
Overall, then this searches for a "T" followed by one or more alphanumeric characters, ending when it hits anything else.

As you can see, regex isn't entirely intuitive. To learn it, a good starting point is the Python regex howto. Also useful is the re library documentation. The Wikipedia entry is also useful, but note that not all languages implement regex in exactly the same way.