How to Correctly Parse File Names in Bash

Shell Bash

Bash file naming conventions are very rich, and it is easy to create a single script or line that incorrectly parses file names. Learn how to parse file names properly and make sure your scripts work as expected!

The problem of parsing file names correctly in Bash

If you have been using Bash for a while and have written scripts in its rich Bash language, you have likely encountered problems parsing filenames. Let’s take a look at a simple example of what can go wrong:

touch a
> b ‘

Configuring a file with a CR character in the file name

Here we have created a file in which a real CR (carriage return) is entered by pressing Enter after the a. The naming conventions for Bash files are very rich, and while this is interesting in some ways, we can use special characters like these in a filename, let’s see how this file behaves when we try to perform some actions on it:

ls | xargs rm

The problem when trying to handle a filename that includes CR

It did not work. xargs will take input from ls (via pipe |), and pass it to rm, but something went wrong in the process!

What’s wrong is that the output of ls is taken literally by xargs, and the ‘enter’ (CR – Carriage Return) in the filename is seen by xargs as an actual terminator, not as a CR to pass on rm as it should be.

Let’s illustrate this in another way:

ls | xargs -I {} echo ‘{} |’

Showing how xargs will see the CR character as a newline and split the data on it

It’s clear: xargs treats the entry as two individual lines, dividing the original filename in half! Even if we had to solve space problems by some sophisticated analysis with sed, we would soon run into other issues when we started using other special characters such as spaces, backslashes, quotes and more!

touch a
b ‘
‘a b’ key
touch ‘a b’
key “a” b ”
touch “a’b”
ls

All kinds of special characters in file names

Even if you are a seasoned Bash developer, you may shiver when seeing filenames like this, as it would be very complex for most common Bash tools to parse these files properly. You would have to do all kinds of string changes to make this work. In other words, unless you have the secret recipe.

Before we dive into that, there is one more thing – a must-know – that you may come across when analyzing the output of ls. If you are using color coding for directory listings, which is enabled by default on Ubuntu, it is easy to run into another set of ls scanning issues.

These don’t really have to do with how files are named, but rather with how files are presented as ls output. The ls output will contain hex codes that represent the color to use on your terminal.

To avoid encountering them, just use –color = never as an option for ls:
ls –color = never.

In Mint 20 (a great operating system derived from Ubuntu), this issue appears to be resolved, although the issue may still be present in many other or older versions of Ubuntu etc. I saw this issue as recent as mid August 2020 on Ubuntu.

Even if you do not use color coding for your directory listings, your script may be running on other systems that you do not own or manage. In such a case, you will also want to use this option to prevent users of such machine from running in the described problem.

Returning to our secret recipe, let’s see how we can make sure that we won’t have any issues with special characters in Bash filenames. The provided solution avoids any use of ls, which one would do well to avoid in general, so color coding issues are also not applicable.

There are still times when parsing ls is quick and convenient, but it will still be tricky and probably ‘dirty’ as soon as special characters are introduced – not to mention insecurity (special characters can be used to introduce. all kinds of problems).

The secret recipe: NULL termination

The developers of Bash tools realized this same problem several years ago and provided us with: the NULL termination!

What is the NULL termination you are asking for? Consider how in the examples above, CR (or literally enter) was the main terminating character.

We have also seen how special characters such as quotes, white space, and backslashes can be used in file names, although they have special functions when it comes to other tools. Bash text parsing and modification like sed. Now compare that with the -0 option to xargs, from man xargs:

-0, –null Input elements are terminated with a null character instead of a white space, and the quotation marks and backslash are not special (each character is taken literally). Disables the end of the file string, which is treated like any other argument. Useful when input elements can contain spaces, quotes, or backslashes. The GNU find -print0 option produces an entry suitable for this mode.

And the -print0 option to find, from man find:

-fprint0 file True; prints the full name of the file to standard output, followed by a null character (instead of the newline character used by -print). This allows file names that contain newlines or other types of white space to be correctly interpreted by programs that process the search output. This option corresponds to the -0 option of xargs.

The truth; here means that if the option is specified, the following is true;. Also of interest are the two clear warnings given elsewhere in the same man page:

If you are moving the output of find to another program and there is any possibility that the files you are looking for contain a new line, you should seriously consider using the -print0 option instead of -print. See the UNUSUAL FILE NAMES section for more information on how unusual characters in file names are handled.
If you are using find in a script or in a situation where the corresponding files may have arbitrary names, you should consider using -print0 instead of -print.

These clear warnings remind us that parsing filenames in bash can be, and remains, a tricky business. However, with the right options to find, namely -print0, and xargs, namely -0, all of our special characters containing filenames can be parsed correctly:

ls
find . -name ‘a *’ -print0
find . -name ‘a *’ -print0 | xargs -0 ls
find . -name ‘a *’ -print0 | xargs -0 rm

The solution: find -print0 and xargs -0

We first check our directory listing. All of our filenames containing special characters are there. We then do a simple search … -print0 to see the output. We note that the strings are terminated by NULL (with the NULL or 0 – the same character – not visible).

We also notice that there is a single CR in the output, which matches the single CR we entered in the first filename, consisting of a followed by enter followed by b.

Finally, the output does not introduce a newline (also containing CR) before returning the $ terminal prompt, because the strings were NULL and not completed CR. We hit Enter at the $ terminal prompt to make things a bit clearer.

Next, we add xargs with the options -0, which allows xargs to handle the NULL-terminated entry correctly. We see that the input passed to and received from ls looks clear and there is no text transformation going on.

Finally, we try our rm command again, and this time for all the files including the original one containing the CR that we had problems with. The rm works perfectly and no errors or parsing problems are observed. Awesome!

Wrap

We have seen how important it is in many cases to properly parse and handle filenames in Bash. While learning how to properly use find is a bit more difficult than just using ls, the benefits it offers may ultimately pay off. Increased security and no problem with special characters.

If you enjoyed this article, you can also read How to Bulk Rename Files to Digital Filenames in Linux which shows an interesting and somewhat complex find -print0 | xargs instruction -0. Enjoy!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.