How to Peek Inside Binary Files From the Linux Command Line

A stylized Linux terminal with lines of green text on a laptop.fatmawati achmad zaenuri / Shutterstock

Do you have a mystery case? The Linux file command will quickly tell you what type of file it is. If it is a binary file, you can find out more about it. The file contains a whole series of stable partners that will help you analyze it. We will show you how to use some of these tools.

Identifying file types

Files generally have characteristics that allow software packages to identify what type of file it is, as well as what data it contains. It would not make sense to try to open a PNG file in an MP3 music player, so it is both useful and pragmatic that a file carries with it some form of identification.

It can be a few signature bytes at the very beginning of the file. This allows a file to be explicit about its format and content. Sometimes the file type is derived from a distinctive aspect of the internal organization of the data itself, known as file architecture.

Some operating systems, such as Windows, are entirely guided by the extension of a file. You can call it gullible or trusted, but Windows assumes that any file with the DOCX extension is really a DOCX word processor file. Linux is not like that, as you will see soon. He wants proof and looks inside the file to find it.

The tools described here were already installed on the Manjaro 20, Fedora 21 and Ubuntu 20.04 distributions that we used to search for this article. Let’s start our investigation using the file command.

Using the file command

We have a collection of different types of files in our current directory. They are a mixture of documents, source code, executables and text files.

The ls command will show us what’s in the directory, and the -hl option (human readable sizes, long list) will show us the size of each file:

ls -hl

ls -hl in a terminal window.

Let’s try to classify some of them and see what we get:

build_instructions.odt file
build_instructions.pdf file
COBOL_Report_Apr60.djvu file

build_instructions.odt file in a terminal window.

The three file formats are correctly identified. As far as possible, the file gives us a little more information. The PDF file would be in the format version 1.5.

Even if we rename the ODT file to have an extension with the arbitrary value of XYZ, the file is still correctly identified, both in the file browser and on the command line using file.

OpenDocument file correctly identified in the Files file browser, even if its extension is XYZ.

In the Files file browser, the correct icon is given. On the command line, the file ignores the extension and looks inside the file to determine its type:

build_instructions.xyz file

build_instructions.xyz file in a terminal window.

The use of files on supports, such as image and music files, generally gives information concerning their format, encoding, resolution, etc .:

screenshot.png file
screenshot.jpg file
Pachelbel_Canon_In_D.mp3 file

screenshot.png file in a terminal window.

Interestingly, even with plain text files, the file does not judge the file by its extension. For example, if you have a file with the extension “.c”, containing standard plain text but no source code, the file does not confuse it with a real C source code file:

file function + headers.h
makefile
hello.c file

file function + headers.h in a terminal window.

file correctly identifies the header file (“.h”) as part of a collection of C source code files, and knows that the makefile is a script.

Using files with binary files

Binaries are more of a “black box” than others. Image files can be viewed, sound files can be played, and document files can be opened by appropriate software. Binaries, however, are more difficult.

For example, the files “hello” and “wd” are binary executables. These are programs. The file called “wd.o” is an object file. When the source code is compiled by a compiler, one or more object files are created. These contain the machine code that the computer will possibly run during the execution of the finished program, as well as information for the linker. The linker checks each object file for function calls to libraries. It links them to all the libraries used by the program. The result of this process is an executable file.

The file “watch.exe” is a binary executable that has been cross-compiled to run on Windows:

wd file
wd.o file
Hello
watch.exe file

wd file in a terminal window.

By taking the last one first, the file tells us that the file “watch.exe” is an executable PE32 +, a console program, for the family of x86 processors under Microsoft Windows. PE stands for portable executable format, which has 32 and 64 bit versions. PE32 is the 32-bit version and PE32 + is the 64-bit version.

The other three files are all identified as Executable and linkable format (ELF). It is a standard for executable files and shared object files, such as libraries. We will be examining the ELF header format shortly.

What might catch your attention is that the two executables (“wd” and “hello”) are identified as Linux Standard Base (LSB) shared objects, and the object file “wd.o” is identified as a movable LSB. The word executable is evident in its absence.

Object files are movable, which means that the code they contain can be loaded into memory at any location. Executables are listed as shared objects because they were created by the linker from object files in such a way that they inherit this ability.

This allows the Randomization of the address space layout (ASMR) to load the executables in memory at the addresses of its choice. Standard executables have a load address encoded in their headers, which dictate where they are loaded into memory.

ASMR is a safety technique. Loading executables into memory at predictable addresses makes them vulnerable to attack. Indeed, their entry points and the location of their functions will always be known to attackers. Run independent executables (PIE) positioned at a random address overcomes this susceptibility.

If we compile our program with the gcc compiler and provide the -no-pie option, we will generate a conventional executable.

The -o option (output file) allows us to give a name to our executable:

gcc -o hello -no-pie hello.c

We will use the file on the new executable and see what has changed:

Hello

The size of the executable is the same as before (17 KB):

ls -hl hello

gcc -o hello -no-pie hello.c in a terminal window.

The binary is now identified as a standard executable. We do this only for demonstration purposes. If you compile applications this way, you will lose all the benefits of ASMR.

Why is an executable so big?

Our example hello program is 17 KB, so it could hardly be called large, but then everything is relative. The source code is 120 bytes:

cat hello.c

What inflates the binary if it only prints a string in the terminal window? We know there is an ELF header, but that’s only 64 bytes for a 64-bit binary. Clearly, it must be something else:

ls -hl hello

cat hello.c in a terminal window.

Let’s see scan the binary with the The strings command is a simple first step to discovering what it contains. We will redirect it less:

hello strings | Less

hello strings | least in a terminal window.

There are many strings inside the binary, in addition to “Hello, Geek world!” from our source code. Most of them are labels for the regions within the binary, and the names and binding information of the shared objects. These include the libraries and the functions within those libraries, on which the binary depends.

the ldd command shows us the dependencies of shared objects of a binary:

ldd hello

ldd hello in a terminal window.

There are three entries in the output, and two of them include a directory path (the first does not have one):

linux-vdso.so: Virtual dynamic shared object (VDSO) is a kernel mechanism that allows a set of kernel space routines to be accessed by a user space binary. This avoids the overhead of a change of context from the user’s kernel mode. VDSO shared objects adhere to the Executable and Linkable Format (ELF), which allows them to be dynamically linked to the binary at runtime. The VDSO is dynamically allocated and takes advantage of the ASMR. VDSO capacity is provided by the standard GNU C Library if the kernel supports the ASMR scheme.
libc.so.6: the GNU C Library shared object.
/lib64/ld-linux-x86-64.so.2: This is the dynamic linker that the binary wants to use. The dynamic link editor query the binary to find out what dependencies it has. It launches these shared objects in memory. It prepares the binary to execute and to be able to find and access the dependencies in memory. Then he launches the program.

ELF header

we can examine and decode the ELF header using the readelf utility and the -h option (file header):

read -h hello

readelf -h hello in a terminal window.

The header is interpreted for us.

Exit from readelf -h hello in a terminal window.

The first byte of all ELF binaries is set to the hexadecimal value 0x7F. The next three bytes are set to 0x45, 0x4C and 0x46. The first byte is a flag that identifies the file as an ELF binary. To be clear, the next three bytes indicate “ELF” in ASCII:

Classroom: Indicates whether the binary is a 32-bit or 64-bit executable (1 = 32, 2 = 64).
The data: Indicate the endianism used. Endian encoding defines how multibyte numbers are stored. In big-endian coding, a number is first stored with its most significant bits. In little-endian coding, the number is first stored with its least significant bits.
Version: The ELF version (currently 1).
OS / ABI: Represents the type of binary application interface used. This defines the interface between two binary modules, such as a program and a shared library.
ABI version: The ABI version.
Type: ELF binary file type. The current values ​​are ET_REL for a movable resource (such as an object file), ET_EXEC for an executable compiled with the -no-pie flag and ET_DYN for an executable supporting ASMR.
Machine: the Instruction set architecture. This indicates the target platform for which the binary was created.
Version: Always set to 1, for this version of ELF.
Entry point address: Memory address in the binary at which execution begins.

The other entries are sizes and numbers of regions and sections in the binary so that their locations can be calculated.

A quick look at the first eight bytes of the binary with hexdump will display the signature byte and the string “ELF” in the first four bytes of the file. The -C (canonical) option gives us the ASCII representation of the bytes next to their hexadecimal values, and the -n (number) option allows us to specify how many bytes we want to see:

hexdump -C -n 8 hello

hexdump -C -n 8 hello in a terminal window.

objdump and the granular view

If you want to see the detail, you can use the objdump command with the -d (unmount) option:

objdump -d hello | Less

objdump -d hello | least in a terminal window.

This disassembles the executable machine code and displays it in hexadecimal bytes next to the assembly language equivalent. The location of the address of the first bye of each line is displayed on the far left.

This is only useful if you can read assembly language or if you’re curious about what’s going on behind the curtain. There is a lot of output, so we channeled it less.

Putput of objdump -d hello | least in a terminal window.

Compilation and linking

There are many ways to compile a binary. For example, the developer chooses whether to include debugging information. The way the binary is linked also plays a role in its content and size. If binary references share objects as external dependencies, it will be smaller than the one to which the dependencies are statically linked.

Most developers are already familiar with the commands we have covered here. For others, however, they offer easy ways to search and see what’s inside the binary black box.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.