Complex Regular Expressions

Welcome to the second part of my tutorial series on regular expressions (or RegEx). In the first part I introduced the basic concepts of regular expressions like 

  • matching arbitrary characters
  • groups
  • sets and alternations
  • repetitions and length

In this blog post I will introduce more complex regular expressions that will help you to solve tricky tasks that would need lots of lines of code if you solved them traditionally using if statements and loops. I will talk about greedy expressions, references, lookaheads and lookbehinds. You will also find a 'cheat sheet' that summarizes the content of the first two parts of the blog series.

Greedy Expressions

In the first part of this series I introduced the dot . as a placeholder that could match arbitrary characters. Placing a star * behind the dot gives you /.*/, a RegEx that matches everything. You have to be careful if you use these expressions because it can happen that you match too much. Look at the following example: Assume that you want to read the properties of an html tag into variables. The first group should match the first property and additional properties should be matched by group 2. This is a task that is common in parsing websites. At first glance this task looks easy and you could create an expression like this:

/<htmltag property1="(.*)" (.*)>/

This look promising at first since the closing angular bracket finishes the expression. But look at the following image I created with the regular expression tool RegExr.com:

 The regular expression is too greedy and matches to much.

The regular expression is too greedy and matches to much.

As you can see the second group matches too much. This can be a very dangerous thing if you use regular expressions to parse user input. An adversary could inject malicious code into your logic. To prevent this behaviour you should restrict the greedy expression. You can do that by placing a ? behind the star. This means that the greedy expression stops after the first match. This could look like this:

 A questionmark behind the star makes the greedy expression stop after the first match.

A questionmark behind the star makes the greedy expression stop after the first match.

In the next part you learn about forbidden characters. Another way of restricting greedy expressions.

Forbidden Characters

Often you have the requirement to forbid some characters from the input. It would be very impractical to list the allowed characters instead. You can forbid a character by placing a ^ in front of it. Usually that is done in square brackets. Remember you can create a list of allowed characers using square brackets (see part 1). Placing a ^ in front of the allowed characters forbids them. With this new tool we can solve the problem of greedy expressions in a different way: simply match everyhing that is not a closing angular bracket. This will terminate the greedy expression at the correct spot.

 Using forbidden characters to restrict a greedy expression.

Using forbidden characters to restrict a greedy expression.

References

Up to this point you learned a lot about regular expressions but can you match every word that ends with the same letter as it starts? To do something like this you have to somehow know the first letter beforehand. You can solve these problems using references. You can reference group number x with the following characters: \x. The following RegEx matches strings that start and end with the same letter (I only consider lower case letters here):

/([a-z])[a-z]*\1/

As you can see the reference points to the first group which matches exactly one letter from the alphabet. Look at the following picture to see how this RegEx performs:

It works well already but we want only whole words and not parts of them. To do this we could insert spaces at the beginning and the end. However these spaces would be matched, too. You would have to account for that in your code. Another way would be using a special symbol. I will introduce some of the most common special symbols next.

Special Symbols AND escaping

In the last chapter we wanted to find words with a certain property (starting and ending with the same letter). There exists a special symbol to identify words. Just place \b in front and behind your RegEx. This prevents the RegEx from matching the spaces. See the differences in the images:

The following table shows some of the most common special symbols:

Special symbol

Description

\w Matches a letter or digit. Equivalent to [a-zA-Z0-9].
\W Matches everything except a letter or digit. Equivalent to [^a-zA-Z0-9].
\d Matches a digits. Equivalent to [0-9].
\d Matches everything except digits. Equivalent to [^0-9].
\b Stands for the border of a word.
\B Stands for everything except the border of a word.
\s Stands for white space like spaces, tabs and returns.
\S Stands for everything except white space.
\ in general The backslash can be used to escape characters that would otherwise be a special symbol. Examples: \. to match a dot, \* to match a star or \( to match an opening bracket.

Lookaheads and Lookbehinds

In the last chapter you learned how to find the borders of a word without matching the whitespace around it. But what if you want to do such a thing for another symbol? For example you want to get every word that is surrounded by commata. So you want to get a matching like this:

 

these,are,some,words

Note that the words ‚these‘ and ‚words‘ should not get matched in this example. Another example would be finding certain digit patterns in a very long digit string. You could try to match only numbers of length 7 which are sourrounded by 3-digit strings that contain only 1, 2 or 3. An example would be this:

00000003126785339221000000000000000

This task could come up if you want to parse data that was sent by an IoT device. To solve these problems you have to use lookahead and lookbehinds. This technique can be used to require something to exist in front of the RegEx (lookbehind, because we are in front of the string and look behind us) or at the end of it (lookahead).

To solve the first problem we can create the following RegEx:

/(?<=,)([\w]+)(?=,)/

The lookahead and lookbehind are marked. As you can see the syntax is (?<=.) for a lookbehind and (?=.) for a lookahead. Note that you have to replace the dot with the characters you want to match. You can not use lookaheads and lookbehind with variable length. So something like (?=[123]+) will not work.

The solution for the second problem looks like this:

/(?<=[123]{3})([\d]{7})(?=[123]{3])/

Negative Lookaheads and Lookbehinds

Instead of positively looking ahead and behind you can forbid certain symbols, too. This would be a negativ lookahead or a negative lookbehind. The syntax is similar to positive lookaheads and lookbehinds: 

  • (?<!.) for a negative lookbehind
  • (?!.) for a negative lookahead

A negative lookahead could be useful to parse email addresses. Assume that you have a blacklist of top-level domains you do not want to process emails from. Then you change an email matching RegEx from

/^[^@\s]+@.+\.[^.]+$

to

^[^@\s]+@.+\.(?!(abc$))[^.]+$

This RegEx will now match email addresses that do not have the top-level domain ‚.abc‘. 

Examples

Feel free to test your complex regular expressions in an online editor like regexr.com. However have in mind that lookbehinds are not part of Javascript at the moment and won’t work there. More about this in the next part of this blog series. The following table shows some examples for the regular expressions in this post.

String

RegEx

Comment

Just match everything. Greedy as your are ... ="§)$/"§=($ /.*/ Greedy expressions match everything
Just match everything. Greedy as your are ... ="§)$/"§=($ /[^.]*/ Forbidden characters help restricting greedy expressions.
00000123Hey, What's up?12300 /(123).*\1/ Reference the content of a group with a backslash and the group number.
00000123Hey, What's up?12300 /(?<=123).*(?=123)/ Lookaheads and lookbehinds can be used to check if a match exist in front or behind the RegEx.
hhllcks@gmail.com
hhllcks@mymail.abc
/^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.(?!abc$)[a-zA-Z0-9-.]+$/ A handy RegEx for E-Mails that are do not have the top-level domain .abc

SUMMARY

This was the second part of my series on regular expressions. The following table summarizes the contents of this post.

Character

Description

.* Greedy Expression
^ Negates sets of allowed characters to forbid them.
\%GroupNumber% Use backslash and the group number to reference the characters that where matched by the group
\w Matches a letter or digit. Equivalent to [a-zA-Z0-9].
\W Matches everything except a letter or digit. Equivalent to [^a-zA-Z0-9].
\d Matches a digits. Equivalent to [0-9].
\d Matches everything except digits. Equivalent to [^0-9].
\b Stands for the border of a word.
\B Stands for everything except the border of a word.
\s Stands for white space like spaces, tabs and returns.
\S Stands for everything except white space.
\ in general The backslash can be used to escape characters that would otherwise be a special symbol. Examples: \. to match a dot, \* to match a star or \( to match an opening bracket.
(?<=%) Positive Lookbehind for %
(?<!%) Negative Lookbehind for %
(?=%) Positive Lookbahead for %
(?!%) Negative Lookbahead for %

In the next parts of this blog series I will show you how to use regular expressions in practice. They will explain search & replace operations in large text files and regular expressions in several programming languages like Javascript, Python and ABAP.

If you have any comments on how to improve this tutorial just comment below or tweet at me.