Regular Expressions in Python

This blog post is another entry in my series on regular expressions. In this post I will show you how to use regular expressions in Python. If you want to learn about regular expressions from the ground up please read my introductory posts on them. This post will mirror my post on regular expressions in Javascript. Therefor I will show you how to perform the following three tasks in Python:

  • test if a string matches a regular expression
  • find all matches in a string
  • find & replace in a string

Syntax

First I will introduce to you the syntax of regular expressions in Python. In contrast to Javascript Python does not come with a build-in literal to work with regular expressions. However handling regular expressions is fairly easy in Python, too. You just need to import the module re from the standard library to work with the following methods:

  • compile: creates a regular expression object
  • match: match the regular expression at the beginning of the string
  • search: search for matches in the whole string
  • findall: returns all non-overlapping matches
  • finditer: returns an iterator to iterate through matches
  • sub: can be used to replace a match in a string

Let's start with compiling our first regular expression object.

regex = re.compile('[12]34(56)+')

Just like in Javascript you can add flags to your regular expression object. Python offers six flags that change the matching behaviour of the regular expression. You can add the flags by passing them as the second parameter to the compile function. Separete them with the bitwise OR (the | operator).

  • re.IGNORECASE: perform a case-insensitive matching
  • re.LOCALE: make \w, \W, \b, \B, \s and \S dependent on the current locale
  • re.MULTILINE: perform a multi-line matching which means that ^ and $ do not stand for the beginning and the end of the whole string but for the beginning and the end of a single line
  • re.DOTALL: make the . match any character including a newline
  • re.UNICODE: make \w, \W, \b, \B, \s and \S dependent on the Unicode character properties database
  • re.VERBOSE: this allows you to create more readable regular expression. Whitespace in the pattern is ignored and you can create comments (does not change the matching behaviour)

If you want to your regular expression to ignore case and match multiline then you can define it like this:

regex_im = re.compile('[12]34(56)+', re.MULTILINE | re.IGNORECASE)

TEST IF A PATTERN MATCHES

There are several ways how you can test if a string matches your pattern but the easiest one would be calling the method match() on your regular expression object. This method will return a MatchObject instance if the string matches the pattern at the beginning. Here is an example:

 match() find the pattern at the beginning of the string

match() find the pattern at the beginning of the string

This will return a match object with the position 0. Note that match() will not find matches that are not at the beginning of the string. For these cases you can use the method search():

 Notice that the search() method finds the match while match() does not. The reason is that match() only looks at the beginning of the string.

Notice that the search() method finds the match while match() does not. The reason is that match() only looks at the beginning of the string.

Find all matches in a string

Now you know how to test if a string matches a pattern (using match()) and how to find the first match in a string (using search()). But what if you want to find every match in a string? For example you want to find every hyperlink in an html document. To get every match you can use the method finditer() which returns an iterator that yields the corresponding MatchObjects for each match. The following snippet shows you a piece of code that prints the starting position for each match.

 finditer() allows you iterator through the matches

finditer() allows you iterator through the matches

find and replace in a string

Now that you have found every match in a string you may want to manipulate it. You can use the method sub() to do that. It takes two mandatory arguments: repl and stringrepl can either be a string or a function. If it is a string then every match will be replaced by this string. If it is a function that this function will be called for each match. The function then has to return a string that will be inserted instead of the match. Look at the following snippet for an example:

 The first sub() call replaces with a string, the second call replaces using a function.

The first sub() call replaces with a string, the second call replaces using a function.

In the second example the repl parameter is a function call replace_by_sum. This function loops through the characters of the match (in this case all numbers) and adds them. It will then return this sum inside parantheses. The first call just replaced the match with the string 'MATCH'.

Summary

To get started with regular expressions in Python you need to know the following things:

  1. You can create regular expressions objects using re.compile('regular expression', FLAG | FLAG | ...)
  2. You can check if there is a match by calling match(string) or search(string) on the RegExp object 
  3. re.finditer() will return an iterator for the MatchObjects
  4. Using sub() you can search and replace inside a string

Further Reading

The officiall documentation on the python.org website is great. You can find it here for Python 3.6. Pythex.org is a great testing environment for regular expressions in Python.

If you have feedback or found an error please comment below or tweet at me. I am happy to update this blog post in the future. In the upcoming weeks I will release more blog posts about regular expressions in other programming languages.

GIST with example

The following Gist contains the examples that were shown in this post.