Basic Regular Expressions

This is the first part of a three part series on regular expressions. It will introduce the concept of regular expression and show you the most common usages. With regular expression you can greatly reduce the lines of code on things like string search, validation or replacing.

Introducing Regular Expressions

Regular Expressions (or 'RegEx' in short) are a language that defines string patterns. These patterns can be used in programming to replace a lot of lines of code by a single line. The following three usages of regular expressions are the most common:

  • String Search: Have you ever tried to manually code some logic that searchs for a specific string inside another string? Most programming languages offer helper functions for that but a lot of times there are requirements that a simple search can not solve. An example would be: you want to search a credit card number inside a string but you do not know the number.
  • String Validation: a related problem is string validation. Your program gets a string from a certain source (input by the user or another program) and has to validate it. Maybe you created a form and want to check the phone number, email adress or credit card number. You could get more technical, too, and try to validate requests to your API.
  • Replacing inside a string: Image you have really large file and you have to place quotes around numbers in this file. The numbers can have an arbitrary length.

Think about coding a program that solves these problems. Maybe you would loop through the characters of the string and fire up some logic if you see certain characters. In any case: the lines of code would go through the roof. Regular expressions are here to help you with these problems.

Regular Expressions originate from theoretical computer science and every computer scientist will learn about them in the studies. However a lot of peope do not transfer this knowledge into the day to day work. Today a lot of programming languages have integrated regular expressions in libraries or even their syntax.

Notation

I will write strings in single quotes (like 'this') and regulare expressions in slashes (like /this/).

A first example

At first I have to warn you: regular expression can look very frightening and complex to the point of looking like someone fell asleep during a lecture and dropped their head on the keyboard. But don't be afraid: we will start slow and build up our knowledge of regular expressions. We will start with a simple task. If you want to match a string you just write the string. Let's say you want to check if the string from the input contains the word 'dog'. Then you can use the regular expression /dog/. Assume your input is:

The dog jumped over the fence.

The regular expression /dog/ will match the following:

The dog jumped over the fence.

Next we introduce the characters that define the beginning and the end of the string. Because maybe you want to make sure that your string indeed is just 'dog'. You can fulfil thie requirement with the circumflex accent (^) and the dollar sign ($). So the regular expression /^dog$/ will only match one string: 'dog'.

Arbitrary Characters

To match any character you can use the dot. So if you want to make sure that your string is of length 3 and ends in 'og' you can use the following RegEx: /.og/. If you want to match the dot itself, you have to escape it with a backslash like this /\.og/. This RegEx matches the string '.og'.

Groups

Now that you have learned how to match a string we can add groups. Groups are very useful in more complicated regular expressions and they are created with parantheses. Groups themselves do not change the meaning of a RegEx. So the regular expressions

  • /^(dog)$/
  •  /^(do)(g)$/
  •  /^(d)(o)(g)$/
  • /^(d)og$/

all match the same string: 'dog'. Notice how weird the RegEx are starting to look now? Don't worry it will get even weirder soon.

Sets and alternations

Sets are a very usefel tool if you want to find a certain structure in your string. Let us assume that you want your input to start with a number. With our current knowledge we couldn't build a RegEx for that. A set replaces a single character in your RegEx with a set of possible characters. So the RegEx /^[df]og$/ matches the two following strings: 'dog' and 'fog'. So if you want to check for a 3-digit number you can use the following RegEx: /^[0123456789][0123456789][0123456789]$/. You may have noticed that this notation is a little bit long if the set of the possible characters is rather long. You can shorten the definition of your sets with the following syntax:

  • a-z stands for the letters from a to z 
  • A-Z stands for the letters from A to Z
  • 0-9 stands for the numbers from 0-9
  • you can of course use different start and endcharacters like 4-8 for the numbers from 4 to 8

So if you want to match a 3-digit number you can just write /^[0-9][0-9][0-9]$/. You can concatenate the different sets like this: /^[1-36-8]$/ accepts any number from 1 to3 or from 6 to 8. We have excluded 0, 4, 5 and 9. You can of course mix numbers and letters. A simple check if a filename starts with a letter or a number could look like this: /^[a-zA-Z0-9]/. Notice that I dropped the dollar sign to just look at the start of the string?

An alternation is represented by the pipe symbol '|'. The RegEx matches either the left or the right token next to the pipe. A simple example would be /^(d|f)og$/. This RegEx matches the strings 'dog' or 'fog', too. You see: with regular expression there are often multiple ways to get to the same goal.

Repetitions and length

Having to write [0-9] three times in our RegEx still seems a bit long. RegEx therefor offer the possibility to repeat certain groups or sets. There are four major techniques or special characters for that:

  • curly braces accept two integers separated by a comma. These are the min and the max value of repetitions of the previous token. You can leave the max value blank to represent infinity or enter a single number if the amount of repetitions is fixed
  • ? is a shorthand for {,1} - the token either exists or not
  • + is a shorthand for {1,} - the token can be repeated from 1 to infinity times
  • * combines ? and + and means that the token can appear between zero and infinity times

So in our 3-digit example we can reduce the RegEx to /^[0-9]{3}$/.

Examples (Groups, Sets, Repetitions)

In this chapter I will present you some strings and the corresponding regular expressions that match these strings (or the highlighted parts of them). Please make sure that you understood how the regular expression worked in each case before you proceed.

String

RegEx

Comment

My name is Peter. I was always named Peter because something like that rarely changes. /Peter/ Find your name in a string
My name is John. /[Mm]y Name is / Find the places in a string where someone introduces himself.
RstartsWithRandEndsWithG /^R.*G$/ Make sure that a string starts and ends with a certain character.
12345 abcde ABCDE 123aa aa123 123AA AA123 aaaAA AAaaa /([0-9]{5}|[a-z]{5})|([A-Z]{5})/ Find strings of length 5 that contain either digits, lowercase or uppercase letters.
hhllcks@gmail.com /^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$/ A handy RegEx for E-Mails

Summary

In this first part of my series on regular expressions you learned how to create simple expressions to match patterns like phone numbers or emails. Here is a list of symbols that I introduced with a short description.

  • any character matches any character (however: special characters have to be escaped with a backslash like \.
  • groups are created with parantheses
  • sets are created by brackets
  • an alternation is created with the pipe symbol '|'
  • repetitions

Character

Description

any character that is not a special character Matches this character.
( ) Group
[ ] Set
| Alternation
{x,y} Repeat previous token at least x times and at most y times
* Repeat previous token at least 0 times and at most infinity times
+ Repeat previous token at least 1 times and at most infinity times
? Repeat previous token at least 0 times and at most 1 times

In the next part of this blog series I will introduce more complex features of regular expression. In the meantime you can test your new RegEx skills with a RegEx tester like www.regexr.com.

If you have any comments on how to improve this tutorial just comment below or tweet at me.