How to Use the join command on Linux

A terminal prompt ready for command on a Linux system.Fatmawati Achmad Zaenuri / Shutterstock

If you want to merge data from two text files by matching a common field, you can use the Linux join command. It adds a dash of dynamism to your static data files. We will show you how to use it.

Data match between files

Data is king. Businesses, businesses and households run it. But the data stored in different files and gathered by different people is a pain. In addition to knowing which files to open to find the desired information, the layout and format of the files may be different.

You also need to manage administrative headaches whose files need to be updated, which need to be backed up, which are inherited and which can be archived.

In addition, if you need to consolidate your data or perform analysis on a complete data set, you have an additional problem. How do you streamline data between different files before you can do with what you need to do? How do you approach the data preparation phase?

The good news is that if the files share at least one common data item, the Linux join command can get you out of the quagmire.

Data files

All the data we will use to demonstrate the use of the join command is fictitious, starting with the following two files:

cat-1.txt file
chat file-2.txt

Here is the content of file-1.txt:

1 Adore Varian avarian0@newyorker.com Woman 192.57.150.231
2 Nancee Merrell nmerrell1@ted.com Woman 22.198.121.181
3 Herta Friett hfriett2@dagondesign.com Woman 33.167.32.89
4 Torie Venmore tvenmore3@gmpg.org Woman 251.9.204.115
5 Deni Sealeaf dsealeaf4@nps.gov Woman 210.53.81.212
6 Fidel Bezley fbezley5@lulu.com Male 72,173,218.75
7 Ulrikaumeko Standen ustanden6@geocities.jp Female 4.204.0.237
8 Odell Jursch ojursch7@utexas.edu Male 1,138.85,117

We have a set of numbered lines, and each line contains all of the following information:

A number
A first name
Last name
An e-mail address
Person’s gender
An IP address

Here is the content of the file-2.txt:

1 Varian avarian0@newyorker.com Female Western New York $ 535,304.73
2 Merrell nmerrell1@ted.com Female Finger Lakes $ 309,033.10
3 Friett hfriett2@dagondesign.com Female Southern Tier $ 461,664.44
4 Venmore tvenmore3@gmpg.org Woman Central New York $ 175,818.02
5 Sealeaf dsealeaf4@nps.gov Woman North Country $ 126,690.15
6 Bezley fbezley5@lulu.com Male Mohawk Valley $ 366,733.78
7 Standen ustanden6@geocities.jp Women’s Capital District $ 674,634.93
8 Jursch ojursch7@utexas.edu Male Hudson Valley $ 663,821.09

Each line of the 2.txt file contains the following information:

A number
Last name
An e-mail address
Person’s gender
A region of New York
A dollar value

The join command works with “fields”, which means, in this context, a section of text surrounded by spaces, the beginning of a line or the end of a line. In order for the join to match the lines between the two files, each line must contain a common field.

Therefore, we can only match a field if it appears in both files. The IP address only appears in one file, so it’s not good. The first name only appears in one file, so we cannot use it either. The last name appears in both files, but it would be a bad choice, because different people have the same last name.

You also cannot link the data with male and female entries because it is too vague. New York regions and dollar values ​​only appear in one file as well.

However, we can use the email address as it is present in both files and each is unique to an individual. A quick glance in the files also confirms that the lines of each correspond to the same person. So we can use the line numbers as the corresponding field (we will use another field later).

Note that there are a different number of fields in the two files, which is fine – we can tell join which field to use from each file.

However, pay attention to areas like New York areas; in a file separated by spaces, each word in the name of a region looks like a field. Since some regions have two or three word names, you actually have a different number of fields in the same file. That’s okay, as long as you match the fields that appear in the row before the New York areas.

The join command

First, the field you are going to match must be sorted. We have increasing numbers in both files, so we meet these criteria. By default, join uses the first field of a file, that’s what we want. Another reasonable default is that join expects field separators to be spaces. Again, we have that, so we can go ahead and start the join.

As we use all the default settings, our command is simple:

attach file-1.txt file-2.txt

join considers files as “file one” and “file two” in the order in which they are listed on the command line.

The output is as follows:

1 Adore Varian avarian0@newyorker.com Woman 192.57.150.231 Varian avarian0@newyorker.com Woman Western New York $ 535,304.73
2 Nancee Merrell nmerrell1@ted.com Female 22.198.121.181 Merrell nmerrell1@ted.com Female Finger Lakes $ 309,033.10
3 Herta Friett hfriett2@dagondesign.com Female 33,167.32.89 Friett hfriett2@dagondesign.com Female Southern Tier $ 461,664.44
4 Torie Venmore tvenmore3@gmpg.org Woman 251.9.204.115 Venmore tvenmore3@gmpg.org Woman Central New York 175 818.02 $
5 Deni Sealeaf dsealeaf4@nps.gov Woman 210.53.81.212 Sealeaf dsealeaf4@nps.gov Woman North Country $ 126,690.15
6 Fidel Bezley fbezley5@lulu.com Male 72,173,218.75 Bezley fbezley5@lulu.com Male Mohawk Valley $ 366,733.78
7 Ulrikaumeko Standen ustanden6@geocities.jp Female 4.204.0.237 Standen ustanden6@geocities.jp Women’s Capital District $ 674,634.93
8 Odell Jursch ojursch7@utexas.edu Male 1,138.85,117 Jursch ojursch7@utexas.edu Male Hudson Valley $ 663,821.09

The output is formatted as follows: The field on which the lines have been mapped is printed first, followed by the other fields in file one, then the fields in file two without the correspondence field.

Unsorted fields

Let’s try something we know will not work. We will put the lines of a file out of order so that the join cannot process the file correctly. The contents of file-3.txt are the same as file-2.txt, but line eight is between lines five and six.

Here is the content of file-3.txt:

1 Varian avarian0@newyorker.com Female Western New York $ 535,304.73
2 Merrell nmerrell1@ted.com Female Finger Lakes $ 309,033.10
3 Friett hfriett2@dagondesign.com Female Southern Tier $ 461,664.44
4 Venmore tvenmore3@gmpg.org Woman Central New York $ 175,818.02
5 Sealeaf dsealeaf4@nps.gov Woman North Country $ 126,690.15
8 Jursch ojursch7@utexas.edu Male Hudson Valley $ 663,821.09
6 Bezley fbezley5@lulu.com Male Mohawk Valley $ 366,733.78
7 Standen ustanden6@geocities.jp Women’s Capital District $ 674,634.93

We type the following command to try to join file-3.txtto file-1.txt:

attach file-1.txt file-3.txt

join reports that the seventh line of file-3.txt is broken, so it is not processed. Line seven is the one that starts with the number six, which should precede eight in a correctly sorted list. The sixth line of the file (which begins with “8 Odell”) was the last processed, so we see the corresponding output.

You can use the –check-order option if you want to see if the join is satisfied with the sort order of the files – no merge will be attempted.

To do this, we type the following:

join –check-order file-1.txt file-3.txt

join tells you in advance that there will be a problem with line seven of the file file-3.txt.

Files with missing lines

In the 4.txt file, the last line has been deleted, so there is no line eight. The content is as follows:

1 Varian avarian0@newyorker.com Female Western New York $ 535,304.73
2 Merrell nmerrell1@ted.com Female Finger Lakes $ 309,033.10
3 Friett hfriett2@dagondesign.com Female Southern Tier $ 461,664.44
4 Venmore tvenmore3@gmpg.org Woman Central New York $ 175,818.02
5 Sealeaf dsealeaf4@nps.gov Woman North Country $ 126,690.15
6 Bezley fbezley5@lulu.com Male Mohawk Valley $ 366,733.78
7 Standen ustanden6@geocities.jp Women’s Capital District $ 674,634.93

We type the following and, surprisingly, join does not complain and deals with all the lines it can:

join file-1.txt file-4.txt

The output lists seven merged lines.

The -a (print unpairable) option tells join to also print lines that cannot be matched.

Here we type the following command to tell join to print lines from file one that cannot be matched to lines from file two:

join -a 1 file-1.txt file-4.txt

Seven lines are matched and line eight of file one is printed, unmatched. There is no merged information because file-4.txt did not contain line eight to which it could be mapped. However, at least it still appears in the output so you know that it has no correspondence in the 4.txt file.

We type the following command -v (delete attached lines) to reveal all the lines that do not match:

join -v file-1.txt file-4.txt

We see that line eight is the only one that has no match in file two.

Match other fields

Associate two new files on a field which is not the default one (field one). Here is the content of file-7.txt:

avarian0@newyorker.com Woman 192.57.150.231
dsealeaf4@nps.gov Female 210.53.81.212
fbezley5@lulu.com Male 72.173.218.75
hfriett2@dagondesign.com Female 33.167.32.89
nmerrell1@ted.com Woman 22.198.121.181
ojursch7@utexas.edu Male 1,138.85,117
tvenmore3@gmpg.org Woman 251.9.204.115
ustanden6@geocities.jp Female 4.204.0.237

And the following is the contents of the file-8.txt:

Woman avarian0@newyorker.com Western New York $ 535,304.73
Woman dsealeaf4@nps.gov North Country $ 126,690.15
Male fbezley5@lulu.com Mohawk Valley $ 366,733.78
Woman hfriett2@dagondesign.com South level $ 461,664.44
Woman nmerrell1@ted.com Finger Lakes $ 309,033.10
Male ojursch7@utexas.edu Hudson Valley $ 663,821.09
Woman tvenmore3@gmpg.org Central New York $ 175,818.02
Woman ustanden6@geocities.jp Capital District $ 674,634.93

The email address, which is field one in the first file and field two in the second, is the only reasonable field to use for joining. To answer this, we can use the options -1 (one field file) and -2 (two field file). We will follow them with a number that indicates which field of each file should be used to join.

We type the following to tell join to use the first field in file one and the second in file two:

attach -1 1 -2 2 file-7.txt file-8.txt

The files are attached to the email address, which appears as the first field in each line in the output.

Use of different field separators

What if you have files whose fields are separated by something other than spaces?

The following two files are separated by commas – the only space is between multi-word place names:

cat-5.txt file
cat-6.txt file

We can use the -t (separator character) to tell join which character to use as a field separator. In this case, it’s the comma, so we type the following command:

join -t, file-5.txt file-6.txt

All lines are matched and spaces are preserved in place names.

Ignore the case of letters

Another file, file-9.txt, is almost identical to file-8.txt. The only difference is that some email addresses have a capital letter, as shown below:

Woman avarian0@newyorker.com Western New York $ 535,304.73
Woman dsealeaf4@nps.gov North Country $ 126,690.15
Male Fbezley5@lulu.com Mohawk Valley $ 366,733.78
Woman hfriett2@dagondesign.com South level $ 461,664.44
Woman nmerrell1@ted.com Finger Lakes $ 309,033.10
Male Ojursch7@utexas.edu Hudson Valley $ 663,821.09
Woman tvenmore3@gmpg.org Central New York $ 175,818.02
Woman ustanden6@geocities.jp Capital District $ 674,634.93

When we joined file-7.txt and file-8.txt, it worked perfectly. Let’s see what happens with file-7.txt and file-9.txt.

We type the following command:

attach -1 1 -2 2 file-7.txt file-9.txt

We only matched six lines. The differences in uppercase and lowercase letters prevented the other two email addresses from being reached.

However, we can use the -i option (ignore case) to force the join to ignore these differences and to match fields that contain the same text, regardless of case.

We type the following command:

attach -1 1 -2 2 -i file-7.txt file-9.txt

The eight lines are successfully paired and joined.

Mix and match

In joining, you have a powerful ally when you struggle with awkward data preparation. You may need to analyze the data, or perhaps you are trying to model it to import to another system.

Whatever the situation, you will be happy to have joined your corner!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.