How Do You Actually Use Regex?

Regex, short for regular expression, is often used in programming languages ​​to match patterns in strings, find and replace, validate entries, and reformat text. Learning to use Regex correctly can make working with text easier.

Syntax of regex, explained

Regex has a reputation for having horrible syntax, but it is much easier to write than to read. For example, here is a general regex for RFC 5322 compliant email validator:

(?:[a-z0-9!#$%&’*+/=?^_`{|}~-]+ (?: .[a-z0-9!#$%&’*+/=?^_`{|}~-]+) * | “(?:[x01-
x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]| \[x01-x09x0bx0cx0e-x7f]) * “)
@ (? :(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])? .) +[a-z0-9](?:[a-z0-9-]*[a-z0-9])? | [(?
:(?:25[0-5]| 2[0-4][0-9]|[01]?[0-9][0-9]?) .) {3} (?: 25[0-5]| 2[0-4][0-9]|[01]?[0-
9][0-9]? |[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|
\[x01-x09x0bx0cx0e-x7f]) +) ])

If someone seems to have smashed their face against the keyboard, you are not alone. But under the hood, all this mess is actually programming a finite state machine. This machine works for each character, evolving and matching according to the rules you have defined. Many online tools will display railway diagrams, showing how your Regex machine works. Here is this same Regex in visual form:

It’s still very confusing, but it’s much more understandable. It’s a machine with moving parts that have rules defining how everything fits together. You can see how someone put it together; it’s not just a big globe of text.

First: use a Regex debugger

Before you start, unless your Regex is particularly short or you are particularly proficient, you should use an online debugger when writing and testing it. This makes it easier to understand the syntax. We recommend Regex101 and RegExr, which both offer tests and a built-in syntax reference.

How does Regex work?

For now, let’s focus on something much simpler. This is a diagram of Regulex for a very short email Regex (and certainly not RFC 5322 compliant):

The Regex engine starts on the left and runs through the lines, matching the characters as you go. Group # 1 matches any character except a line break, and will continue to match characters until the next block finds a match. In this case, it stops when it reaches an @ symbol, which means that group # 1 captures the name of the email address and all that follows corresponds to the domain.

The Regex that defines group # 1 in our email example is:

(. +)

The parentheses define a capture group, which instructs the Regex engine to include the content of that group’s match in a special variable. When you execute a regular expression on a string, the default return is the entire match (in this case, the entire email). But it also returns each capture group, which makes this Regex useful for extracting names from emails.

The period is the symbol of “Any character except the new line”. That matches everything on one line, so if you forwarded this email, Regex has an address like:

%$#^&%*#%$#^@gmail.com

It would match% $ # ^ &% * #% $ # ^ as the name, even if it’s ridiculous.

The plus symbol (+) is a control structure which means “match the previous character or group one or more times”. It ensures that the whole name matches, not just the first character. This is what creates the loop found on the railroad diagram.

The rest of the Regex is fairly simple to decipher:

(. +) @ (. + .. +)

The first group stops when it reaches the @ symbol. The next group then starts, which again corresponds to several characters until it reaches a period character.

Since characters such as periods, parentheses, and slashes are used as part of the syntax in Regrex, whenever you want to match these characters, you must escape them correctly with a backslash. In this example, to match the period we are writing . and the analyzer treats it as a symbol meaning “correspond to a period”.

Character match

If you have uncontrolled characters in your Regex, the Regex engine will assume that these characters will form a corresponding block. For example, the Regex:

he + llo

Match the word “hello” with any number of e’s. All other characters must be escaped to function properly.

Regex also has character classes, which act as a shortcut for a set of characters. These may vary depending on the implementation of Regex, but these are standard:

. – matches everything except the new line.
w – matches any “word” character, including numbers and underscores.
d – corresponds to the numbers.
b – matches the space characters (i.e., space, tab, newline).

These three all have capital counterparts that reverse their function. For example, D matches anything that is not a number.

Regex also has a character set match. For example:

[abc]

Will correspond to a, b or c. This acts as a block and the hooks are just control structures. You can also specify a range of characters:

[a-c]

Or cancel the set, which will match any character not found in the set:

[^a-c]

Quantifiers

Quantifiers are an important part of Regex. They allow you to match strings whose exact format you don’t know, but you have a pretty good idea.

The + operator in the example e-mail is a quantifier, in particular the “one or more” quantifier. If we don’t know how long a certain string is, but we know that it is made up of alphanumeric characters (and is not empty), we can write:

w +

In addition to +, there are also:

The operator *, which corresponds to “zero or more”. Essentially identical to +, except that it has the possibility of not finding a match.
The ? , which corresponds to “zero or one”. This has the effect of making it optional; either it is there or it is not, and it will not match more than once.
Digital quantifiers. It can be a unique number like {3}, which means “exactly 3 times”, or a range like {3-6}. You can leave out the second number to make it unlimited. For example, {3,} means “3 or more times”. Oddly enough, you can’t omit the first number, so if you want “3 times or less”, you’ll have to

Greedy and lazy quantifiers

Under the hood, operators * and + are greedy. It matches as much as possible and gives what is needed to start the next block. This can be a huge problem.

Here’s an example: let’s say you’re trying to match HTML, or something else with closing braces. Your input text is:

Hi world

And you want to match everything between the hooks. You can write something like:

It’s a good idea, but it fails for a crucial reason: the Regex engine corresponds to “div> Hello World

For the sequence. *, Then go back until the next block corresponds, in this case, to a closing parenthesis (>). You would expect it to go back to match only the “div”, then repeat again to match the closing div. But the backtracker runs from the end of the chain and will stop on the end hook, which will eventually match everything inside the hooks.

The solution is to make our quantifier lazy, which means it will match as few characters as possible. Under the hood, it will actually only fit one character, and then expand to fill the space until the next block match, making it much more efficient in large Regex operations.

Making a quantifier lazy is done by adding a question mark directly after the quantifier. It’s a little confusing because? is already a quantifier (and is actually greedy by default). For our HTML example, the Regex is corrected with this simple addition:

The lazy operator can be nailed to any quantifier, including +?, {0.3} ?, and even ??. Although the latter has no effect; because you are matching zero or one character anyway, there is no room to expand.

Grouping and lookarounds

The groups in Regex have many objectives. At a basic level, they combine several tokens into one block. For example, you can create a group and then use a quantifier on the whole group:

ba (na) +

This groups together the “na” repeated to correspond to the expression banana, banananana, etc. Without the group, the Regex engine would correspond repeatedly to the end character.

This type of group with two single parentheses is called a capture group and will include it in the output:

If you want to avoid this, and simply group the tokens for execution reasons, you can use a non-capturing group:

ba (?: na)

The question mark (a reserved character) defines a non-standard group and the next character defines what type of group it is. Starting groups with a question mark is ideal, because otherwise, if you want to match semicolons in a group, you must escape them for no good reason. But you still have to escape the question marks in Regex.

You can also name your groups, for convenience, when working with the output:

(?’group’)

You can reference them in your Regex, which makes them similar to variables. You can reference unnamed groups with the 1 token, but this only goes up to 7, after which you will need to start naming the groups. The syntax for referencing named groups is:

k {group}

This refers to the results of the named group, which can be dynamic. Essentially, it checks to see if the band is performing multiple times but does not care about the position. For example, this can be used to match all text between three identical words:

The group class is where you will find most of the Regex control structure, including the search heads. Lookaheads ensure that an expression must match, but does not include it in the result. In a way, it is similar to an if statement and will not match if it returns false.

The syntax for positive anticipation is (? =). Here is an example:

This clearly corresponds to the name part of an e-mail address, stopping execution at the @ division. The search heads consume no characters. Therefore, if you want to keep chasing after a successful search head, you can always match the character used in the search head.

In addition to positive expectations, there are also:

(?!) – Negative expectations, which guarantee that an expression does not match.
(? <=) – Positive lookbehinds, which are not supported everywhere due to certain technical constraints. These are placed before the expression you want to match, and they must have a fixed width (that is, no quantizer except {number}. In this example, you can use (? <= @ ) W + . W + to match the domain part of the email.
(? <!) – Negative lookbehinds, which are the same as positive lookbehinds, but canceled.

Differences between Regex motors

Not all Regex are created equal. Most Regex engines do not follow any specific standard, and some modify things a bit depending on their language. Some features that work in one language may not work in another.

For example, versions of sed compiled for macOS and FreeBSD do not support the use of t to represent a tab character. You must manually copy a tab character and paste it into the terminal to use a tab in the sed command line.

Most of this tutorial is compatible with PCRE, the default Regex engine used for PHP. But JavaScript’s Regex engine is different: it doesn’t support capture groups named with quotes (it wants square brackets) and cannot recursively, among other things. Even PCRE is not fully compatible with different versions, and it has a lot of differences from Perl regex.

There are too many minor differences to list here, so you can use this reference table to compare the differences between multiple Regex engines. In addition, Regex debuggers like Regex101 lets you change the Regex engine, so be sure to debug using the correct engine.

How to run Regex

We discussed the corresponding part of regular expressions, which makes up the bulk of what makes a regular expression. But when you really want to run your Regex, you need to train it into a full regular expression.

This usually takes the format:

/ match / g

Everything inside the slashes is our match. The g is a fashion modifier. In this case, it tells the engine not to stop after finding the first match. To find and replace Regex, you will often need to format it as follows:

/ find / replace / g

This replaces throughout the file. You can use capture group references when replacing, which makes Regex very good for formatting text. For example, this regular expression will match all HTML tags and replace the standard square brackets with square brackets:

//[1]/ g

When it works, the motor will match

and

, allowing you to replace this text (and this text only). As you can see, the internal HTML code is not affected:

This makes Regex very useful for finding and replacing text. The command line utility for doing this is sed, which uses the basic format of:

sed / find / replace / g file> file

This runs on a file and exits to STDOUT. You will have to redirect it to itself (as shown here) to actually replace the file on disk.

Regex is also supported in many text editors and can really speed up your workflow during batch operations. Vigor, Atom, and VS CodWe all have Regex find and replace built-in.

Of course, Regex can also be used programmatically and is generally integrated into many languages. The exact implementation will depend on the language, so you will need to consult the documentation for your language.

For example, in JavaScript, the regular expression can be created literally or dynamically using the global RegExp object:

var re = new RegExp (‘abc’)

This can be used directly by calling the .exec () method of the newly created regex object, or by using the .replace (), .match () and .matchAll () methods on strings.