Regexes: The Introductory Lesson

Ever heard of the term regex? What about grep? They both represent the same basic premise: Regular Expressions. (Side note: grep is the Unix tool that can run Regular Expressions)

A Regular Expression (or regex) is a set of characters representing a text pattern. At its simplest, abc is a regex. When applied to the text

abcdef

it matches the text in red, just like a normal search would in your text editor of choice. That’s because we were only dealing with literal, or non-special, characters. The special characters, however, are what makes regexes so powerful.

Special characters hold meaning outside of just the character itself. For example, the dot character represents a wildcard instead of a period, and the ? means the character before it is optional. There are a bunch of special characters, and they dramatically expand your ability to search through text. Let’s look at some examples…

The Optional (?)

The question mark ? is our first special character (or metacharacter), which is a character that has special meaning and doesn’t match the character it normally would. More specifically, it is a repetition operator, which comes after a literal character or another (non-repetitive) special character, and represents the number of times that character should exist at that point in the pattern.

The ?  means that the character it is applied to (aka follows) is optional. Let’s look at an example (follow along on RegExr here).

Say we want to find all the usages of the word “color”, but we know it can also be spelled “colour”. Rather than looking for each version explicitly, we can use the ? to specify that the “u” can be in the matching text, but doesn’t have to be. Here’s our resulting regex with the matches in red:

colou?r

color

Color

colour

colors

colouring

Notice that we didn’t match on “Color”, because regexes are case-sensitive by default. We also didn’t match on any text that followed the words “color” or “colour”, as they weren’t part of the regex. The regex evaluation only matches on the text defined by the pattern.

The At Least Once (+)

The repetition operator is a lot like the ? operator described above. It specifies that the character before it has to happen at least once, but can happen an infinite number of times.

Suppose we love emphasis, but we can’t remember how many times we repeated the letters in the word “really”. We can use the following regex (RegExr link) to find our many instances of our emphatic personality.

r+e+a+l+l+y+

I rrrreaaaalllyyyy don’t like the word really, but mainly because it’s used reallllyyyy obnoxiously. It reallyyyyyyy shouldn’t be used. For real, it’s a pretty boring word. Really.

Again, notice that the regex is case-sensitive by default, so it doesn’t match on “Really”. Aside from that, it looks like it found all of our instances of “really” elongation.

The Wildcard (.)

The dot is another special character, except this one isn’t a repetition operator; it doesn’t modify a previous character. Instead, it just represents any single literal character.

Let’s look at an example (RegExr link):

a.

The dot operator allows you to match on any character. When combined with the literala”, you can find any two character combination that starts with the letter a.

It even matched on the literal “.” character (since in the text, it acts just like any other character literal.

The I Really Don’t Care (*)

The is the most lenient repetition operator. It specifies that the preceding character can happen any number of times, including zero.

Let’s say we’re looking for a specific house number in a set of addresses. We can remember the first and last digit, but can’t remember what (or how many digits) were in-between. And we’ve of course forgotten the street, city, state, and any other useful information, because that’s just the kind of day we’re having.

We’ll create a regex (RegExr link) using both and . to find our forgotten house number, and the run it against our list of addresses:

1.*5

890 Fifth Ave, New York, NY 10065

1 Columbus Circle, New York, NY 10019

175 Fifth Ave, New York, NY 10010

157 Seventh Ave, New York, NY 10015

Pretend there’s a lot more addresses, which would make it a lot harder to find our address in question. It looks like we matched on a zip code that had the same pattern, but we figured we have to sift through some false positives. 175 Fifth Ave happened to be the address we were looking for, and once we saw that match the recognition sparked and our problem was solved. We also matched on an entire address, since our . allowed for any kind of character. We’ll explore how to limit this a little more in our next example.

The Digit Wildcard (\d)

The \d is the digit specific version of a wildcard; it only matches on the numbers 0-9. Note that the \d is still considered a single special character. That’s because, unlike the previous special characters, “d” tends to be used more as a literal character than a special one. The “\” lets the regex engine know to treat the character that follows in a special, abnormal way.

Anyway, the \d makes the previous example match on a little less, and thus easier to find what we’re looking for (RegExr link).

1\d*5

890 Fifth Ave, New York, NY 10065

1 Columbus Circle, New York, NY 10019

175 Fifth Ave, New York, NY 10010

157 Fifth Ave, New York, NY 10015

This time around, we didn’t match on a whole address (which was obviously not what we were looking for). Instead, our matches are limited to just number sequences starting with 1 and ending with 5, like we originally planned.

Continue Your Learning

So that’s your introduction to regexes. They can be extremely powerful text searching tools, but as the old adage goes, “With great power comes great responsibility.” Using a regex as part of a program can lead to some unexpected results.

Just like code, the more complicated a regex is, the harder it is to test. Take for example the regex found here. It’s an attempt at a regex to validate phone numbers. It obviously has some kinks to work out, but it makes you question whether you’ll ever cover every edge case.

There’s also the matter of performance. Certain special characters can combine to make hundreds of thousands of matches, if not worse. Make sure you understand the implications of any regexes you write in production level code, because you can definitely get burned if you’re not careful.

Don’t let any of my cautions scare you off, though. Regex can be extremely useful, and I’ve only just scratched the surface. If you’d like to learn more, see the Additional Resources below.

Happy regexing!

Additional Resources

Regular-Expressions.info – All the information you could ever want about Regular Expressions (including a quickstart page)

RegExr – Online tester for regexes

Regex Tuesday Challenges – For pushing your regex skills to their limits

 

Leave a Reply

Your email address will not be published. Required fields are marked *