How to Use Regular Expressions (regexes) on Linux

A laptop showing a Linux terminal with lines of green text.Fatmawati Achmad Zaenuri / Shutterstock

Wondering what these strange symbol strings do in Linux? They give you magic from the command line! We will teach you how to cast regular expression spells and improve your command line skills.

What are regular expressions?

Regular expressions (regexes) are a way to find matching character sequences. They use letters and symbols to define a pattern sought in a file or a flow. There are several different flavors of regex. We will examine the version used in common Linux utilities and commands, such as grep, the command print lines that match a search pattern.

Whole books have been written on regular expressions, so this tutorial is just an introduction. There are basic and extended regular expressions, and we will use the extension here.

To use extended regular expressions with grep, you must use the -E (extended) option. Because it gets tedious very quickly, the egrep command was created. The egrep command is the same as the grep -E combination, you just don’t need to use the -E option every time.

If you find it more convenient to use egrep, you can. However, be aware that it is officially obsolete. It is still present in all of the distributions we checked, but it may disappear in the future.

Of course, you can always create your own aliases, so your favorite options are always included for you.

RELATED: How to Create Aliases and Shell Functions on Linux

Small beginnings

For our examples, we will use a text file containing a list of Geeks. Remember that you can use regular expressions with many Linux commands. We just use grep as a practical way to demonstrate them.

Here is the content of the file:

less geek.txt

The first part of the file is displayed.

Let’s start with a simple search pattern and look for occurrences of the letter “o” in the file. Again, since we use the -E (extended regex) option in all of our examples, we type the following:

grep -E ‘o’ geeks.txt

Each line containing the search pattern is displayed and the corresponding letter is highlighted. We performed a simple search, without any constraints. It doesn’t matter whether the letter appears several times, at the end of the chain, twice in the same word, or even next to itself.

Some names had a double O; we type the following to list only these:

grep -E ‘oo’ geeks.txt

As expected, our result set is much smaller and our search term is interpreted literally. It means nothing more than what we typed: double “o” characters.

We will see more functionality with our search models as we move forward.

Line numbers and other grep tips

If you want grep to list the line number of the corresponding entries, you can use the -n (line number) option. This is a grep tip – this is not part of the regex functionality. However, sometimes you may want to know where the corresponding entries are in a file.

We type the following:

grep -E -n ‘o’ geeks.txt

Another handy grep tip you can use is the -o option (matching only). It only displays the corresponding character sequence, not the surrounding text. This can be useful if you need to quickly analyze a list of duplicates on one of the lines.

To do this, we type the following:

grep -E -n -o ‘o’ geeks.txt

If you want to minimize the output, you can use the -c (count) option.

We type the following to see the number of lines in the file that contain matches:

grep -E -c ‘o’ geeks.txt

The work-study operator

If you want to search for double “l” and double “o” occurrences, you can use the pipe character (|), which is the alternation operator. It searches for matches for the search pattern on the left or right.

We type the following:

grep -E -n -o ‘ll | oo ‘geeks.txt

Any line containing a double “l”, “o” or both appears in the results.

Case sensitivity

You can also use the alternation operator to create search patterns, like this:

am | Am

This will match both “am” and “Am”. For anything other than trivial examples, this quickly leads to heavy search patterns. A simple way to work around this is to use the -i option (ignore case) with grep.

To do this, we type the following:

grep -E ‘am’ geeks.txt
grep -E -i ‘am’ geeks.txt

The first command produces three results with three highlighted matches. The second command produces four results because the “Am” in “Amanda” is also a match.

Anchor

We can also match the sequence “Am” by other means. For example, we can search for this pattern specifically or ignore case, and specify that the sequence should appear at the start of a line.

When you match sequences that appear to the specific part of a character line or word, this is called anchoring. You use the caret symbol (^) to indicate that the search pattern should consider a sequence of characters as a match only if it appears at the start of a line.

We type the following (note that the cursor is inside the single quotes):

grep -E ‘Am’ geeks.txt

grep -E -i ‘^ am’ geeks.txt

These two commands correspond to “Am”.

Now, let’s look for lines that contain a double “n” at the end of a line.

We type the following, using a dollar sign ($) to represent the end of the line:

grep -E -i ‘nn’ geeks.txt
grep -E -i ‘nn $’ geeks.txt

Wildcards

You can use a period (.) To represent any single character.

We type the following to find patterns that start with “T”, end with “m” and have only one character between them:

grep -E ‘T.m’ geeks.txt

The search pattern matched the “Tim” and “Tom” sequences. You can also repeat the periods to indicate a certain number of characters.

We type the following to indicate that we don’t care about the middle three characters:

grep-E ‘J … n’ geeks.txt

The line containing “Jason” is matched and displayed.

Use the asterisk (*) to match zero or more occurrences of the previous character. In this example, the character that will precede the asterisk is the period (.), Which (again) means any character.

This means that the asterisk (*) will match any number (including zero) of occurrences of any character.

The asterisk is sometimes confusing for new arrivals. Perhaps it is because they usually use it as a wildcard which means “anything”.

In regular expressions, however, “c * t” does not match “cat”, “cot”, “coot”, etc. On the contrary, it translates as “corresponds to zero or more characters” c “, followed by a” t “. “Thus, it matches” t “,” ct “,” cct “,” ccct “or any number of characters” c “.

Because we know the format of the contents of our file, we can add a space as the last character in the search pattern. A space appears in our file only between the first name and the last name.

Thus, we type the following to force the search to include only the first names of the file:

grep -E ‘J. * n ‘geeks.txt
grep -E ‘J. * n ‘geeks.txt

At first glance, the results of the first command seem to include some strange matches. However, they all correspond to the rules of the research model we used.

The sequence must begin with an uppercase “J” followed by any number of characters, then with an “n”. However, although all games start with “J” and end with “n”, some of them are not what you might expect.

Because we added the space in the second search reason, we got what we wanted: all first names starting with “J” and ending with “n”.

Character classes

Suppose we want to find all lines starting with an uppercase “N” or “W”.

If we use the following command, it matches any line with a sequence that begins with either an uppercase “N” or “W”, no matter where it appears in the line:

grep -E ‘N | W ‘geeks.txt

This is not what we want. If we apply the line start anchor (^) to the start of the search pattern, as shown below, we get the same set of results, but for a different reason:

grep -E ‘^ N | W ‘geeks.txt

The search matches lines containing a capital “W” anywhere in the line. It also corresponds to the “Plus” line because it begins with a capital “N”. The line start anchor (^) is applied only to the capital “N”.

We could also add a line start anchor to the capital “W”, but that would soon become ineffective in a more complicated search pattern than our simple example.

The solution is to put part of our research model in square brackets ([]) and apply the anchor operator to the group. The hooks ([]) means “any character from this list”. This means that we can omit the alternation operator (|) because we do not need it.

We can apply the line start anchor to all the elements of the list in square brackets ([]). (Note that the line start anchor is outside the square brackets).

We type the following to find a line starting with an uppercase “N” or “W”:

grep -E ‘^[NW]’geeks.txt

The grep -E '^[NW]'geeks.txt

We will also use these concepts in the next set of commands.

We type the following to find a person named Tom or Tim:

grep -E ‘T[oi]m ‘geeks.txt

If the cursor (^) is the first character in parentheses ([]), the search pattern searches for any character that does not appear in the list.

For example, we type the following to find any name that begins with “T”, ends with “m” and in which the middle letter is not “o”:

grep -E ‘T[^o]m ‘geeks.txt

We can include any number of characters in the list. We type the following to find names that start with “T”, end with “m” and contain a vowel in the middle:

grep -E ‘T[aeiou]m ‘geeks.txt

Interval expressions

You can use interval expressions to specify the number of times the previous character or group is to be found in the corresponding string. You put the number in braces ({}).

A number itself specifically means that number, but if you follow it with a comma (,), it means that number or more. If you separate two numbers with a comma (1,2), it means the range of numbers from the smallest to the largest.

We want to search for names that start with “T”, are followed by at least one, but not more than two, consecutive vowels and end with “m”.

So we type this command:

grep -E ‘T[aeiou]{1,2} m ‘geeks.txt

This corresponds to “Tim”, “Tom” and “Team”.

If we want to search for the sequence “el”, we type this:

grep -E ‘el’ geeks.txt

We add a second “l” to the search template to include only sequences that contain the double “l”:

grep -E ‘ell’ geeks.txt

It is equivalent to this command:

grep -E ‘el {2}’ geeks.txt

If we provide a range of occurrences of “at least one and not more than two” of “l”, this will match the sequences “el” and “ell”.

This is subtly different from the results of the first of these four commands, in which all matches were for sequences “el”, including those inside sequences “ell” (and only one “l” is highlighted ).

We type the following:

grep -E ‘el {1,2}’ geeks.txt

To find all the sequences of two or more vowels, we type this command:

grep -E ‘[aeiou]{2,} ‘geeks.txt

Escape the characters

Let’s say we want to find lines in which a period (.) Is the last character. We know that the dollar sign ($) is the end of line anchor, so we can type this:

grep -E ‘. $ ‘geeks.txt

However, as shown below, we are not getting what we expected.

As we saw earlier, the dot (.) Matches any character. Since each line ends with a character, each line was returned in the results.

So how do you prevent a special character from performing its regular expression function when you just want to search for that real character? To do this, you use a backslash () to escape the character.

One of the reasons we use the -E (extended) options is that they require a lot less escape when using basic regular expressions.

We type the following:

grep -e ‘. $ ‘geeks.txt

This corresponds to the actual period character (.) At the end of a line.

Anchor and words

We have covered the start (^) and end of line ($) anchors above. However, you can use other anchors to operate on word boundaries.

In this context, a word is a sequence of characters delimited by spaces (the beginning or the end of a line). So “psy66oh” would count as a word, even if you can’t find it in a dictionary.

The start of the word anchor is ( <); notice that it points to the left at the start of the word. Suppose a name was typed in lowercase by mistake. We can use the grep -i option to do a case-insensitive search for names starting with "h".

We type the following:

grep -E -i ‘h’ geeks.txt

This finds all occurrences of “h”, not just those at the start of words.

grep -E -i ‘ <h' geeks.txt

This only finds those at the start of words.

Let’s do something similar with the letter “y”; we only want to see the cases where it’s at the end of a word. We type the following:

grep -E ‘y’ geeks.txt

This finds all occurrences of “y,” wherever it appears in words.

Now we type the following, using the end of word anchor (/>) (which points to the right or the end of the word):

grep -E ‘y >’ geeks.txt

The second command produces the desired result.

To create a search pattern that searches for an entire word, you can use the delimitation operator ( b). We will use the bounding operator ( B) at both ends of the search pattern to find a sequence of characters that must be inside a larger word:

grep -E ‘ bGlenn b’ geeks.txt
grep -E ‘ Bway B’ geeks.txt

More character classes

You can use shortcuts to specify lists in character classes. These range indicators save you from having to type each member of a list into the search pattern.

You can use all of the following:

A-Z: All capital letters from “A” to “Z”.
a-z: All lowercase letters from “a” to “z”.
0-9: All digits from zero to nine.
d-p: All lowercase letters from “d” to “p.” These free format styles allow you to define your own range.
2-7: All numbers from two to seven.

You can also use as many character classes as you want in a search pattern. The following search pattern corresponds to sequences starting with “J”, followed by an “o” or “s”, then an “e”, “h”, “l” or “s”:

grep -E ‘J[os][ehls]’geeks.txt

In our next command, we will use the range specifier a-z.

Our search command breaks down this way:

H: The sequence must start with “H.”
[a-z]: The next character can be any lowercase letter in this range.
*: The asterisk here represents any number of lowercase letters.
man: The streak must end with “man”.

We put it all together in the following command:

grep -E ‘H[a-z]* man ‘geeks.txt

Nothing is impenetrable

Some regexes can quickly become difficult to analyze visually. When people write complicated regular expressions, they usually start with small ones and add more and more sections until it works. They tend to increase their sophistication over time.

When you try to go back from the final version to see what it does, it’s a whole different challenge.

For example, look at this command:

grep -E ‘^ ([0-9]{4}[- ]) {3}[0-9]{4} |[0-9]{16} ‘geeks.txt

Where would you start to unravel this? We will start at the beginning and take it one piece at a time:

^: The start of the line anchor. So our sequence should be the first thing on a line.
(([0-9]{4}[- ]): The parentheses bring together the elements of the research model in a group. Other operations can be applied to this group as a whole (more on this later). The first element is a character class that contains a range of numbers from zero to nine [0-9]. Our first character is therefore a number from zero to nine. Then we have an interval expression that contains the number four {4}. This applies to our first character, which we know will be a number. Therefore, the first part of the search pattern now has four digits. It can be followed by a space or a dash ([- ]) from another character class.
{3}: An interval specifier containing the number three immediately follows the group. It is applied to the whole group. As a result, our search pattern is now made up of four digits, followed by a space or a hyphen, repeated three times.
[0-9]: Then we have another character class that contains a range of numbers from zero to nine [0-9]. This adds another character to the search pattern, and it can be any number from zero to nine.
{4}: Another interval expression that contains the number four is applied to the preceding character. This means that the character becomes four characters, all of which can be numbers from zero to nine.
|: The alternation operator tells us that everything on the left is a complete search model, and everything on the right is a new search model. So this search command makes it one of two search patterns. The first is made up of three groups of four digits, followed by a space or a hyphen, then four other digits nailed to it.
[0-9]: The second search pattern begins with any number from zero to nine.
{16}: An interval operator is applied to the first character and converts it to 16 characters, all of which are numbers.

Thus, our search model will search for one of the following:

Four groups of four digits, each group being separated by a space or a dash (-).
A group of sixteen digits.

The results are shown below.

This research model searches for common forms of writing credit card numbers. It’s also versatile enough to find different styles, with just one command.

Go slowly

The complexity is usually just a lot of simplicity bolted together. Once you understand the basics, you can create effective and powerful utilities and develop valuable new skills.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.