How to Compare Binary Files on Linux
How can you check if two linux binaries are the same? When it comes to executable files, differences can indicate unwanted or malicious behavior. Here’s the easiest way to check if they’re different.
Comparison of binaries
Linux is rich in possibilities for comparing and analyzing text files. That diff
The command compares two files for you and highlights the differences. It can even contain some lines on either side of the changes to provide context around the changed lines. And the colordiff
The command adds color to make visual analysis of the differences even easier.
Developers and authors use diff
to highlight the differences between different versions of program source code files or draft texts. It’s quick and easy, and you don’t need any technical knowledge to spot the differences between text strings.
In the world of binaries, things aren’t that simple. Binary files do not consist of plain text. They consist of many bytes that contain numeric values. If it is a compressed file such as a TAR archive or a ZIP file, these values represent the compressed files stored in the archive file along with the symbol tables needed to decompress and extract the files.
If the binary file is an executable file, the numeric values of the file’s bytes are interpreted as such things as machine code instructions for the CPU, metadata, labels, or encoded data. Modifications to a binary or a library file are likely to result in different behavior when the binary is run or used by another application.
It’s easy to spoof a file’s creation or modification date and time. This means that there can be two versions of a file with the same name, file size – if the changes replace the existing content byte for byte – and date stamp. And yet one of the files could have been altered.
Secure Hash Algorithms
A secure hash algorithm is a math-based algorithm. It creates a 64-bit value by scanning all the bytes in a file and applying a mathematical transformation to them to generate the hash value. Every day, the same file will always produce the same hash. Even a one byte difference results in a radically different hash.
Often the hash of a file is displayed on the download page. You should generate a hash for the file once you’ve downloaded it. If it differs from the hash shown on the web page, you know the file is compromised. It was either tampered with and replaced with the original file – to trick people into downloading the corrupted file – or it was corrupted in transit.
On our test computer we have two copies of the same file, a shared library. The files have been renamed so they can be in the same directory. In theory, these files should be the same. After all, they should be the same version of the shared library.
ls -l *.so
The files have the same size, same date stamps, and same time stamps. To the casual observer, they will look the same. Let’s use them sha256sum
command and generate a hash for each file.
sha256sum binary_file1.so
sha256sum binary_file2.so
The hashes are completely different, which clearly indicates that there are differences between the two files. If the website shows the hash of the original file, you can discard the mismatched file.
find the differences
If you want to see the changes, there are ways to do that too. You don’t need to be able to decompile the file or understand assembly or machine code just to see the changes. Understand what these changes mean, and what their purpose is would of course require deeper technical knowledge. But just knowing how extensive the changes are can give a clue as to what happened to the file.
If we use diff
For the two binaries, we get a response that’s a bit disappointing.
diff binary_file1.so binary_file2.so
We already knew the files were different. let us try it cmp
.
cmp binary_file1.so binary_file2.so
That tells us a little bit more. The first byte that differs between the two files is byte number 13451. That is, counting from the beginning of the binary, byte 13451 is different in the two binaries. So 13451 is the offset of the first difference from the beginning of the file.
It just so happens that there are bytes throughout the file that contain the hexadecimal value 0x10. This is the value that Linux uses as the newline character in text files. That cmp
The command found 131 bytes with this value between the start of the binary and the position of the first difference. So it thinks it’s on line 132. It really doesn’t mean anything in this context.
If we add those -l
(verbose) option we get useful information.
cmp -l binary_file1.so binary_file2.so
All deviating bytes are listed. The byte number or offset, the value from the first file, and the value from the second file are displayed with one byte per line of output.
The byte values are displayed in octal format instead of the usual hexadecimal format used with binary files. Still, we learned something else. All changed bytes are in a continuous sequence. Their offsets are increased by one for each byte.
That hexdump
The tool outputs a binary file in the terminal window. If we use those -C
(canonical) option, the output lists on each line the offset, the values of 16 bytes at that offset, and—if any—the ASCII representation of the byte values.
hexdump -C binary_file1.so
We can use the output of hexdump
as entrance to diff
to let diff
work as if it were reading two text files.
diff <(hexdump binary_file1.so) <(hexdump binary_file2.so)
diff
finds the different lines and shows the hexadecimal byte values from the first file over the values from the second file. The offset of the first line is 0x3480 or 13440 decimal. Earlier, cmp
informed us that the first change occurred at byte 13451, which is 0x348B. That actually corresponds to what we see here.
The output of diff
is in two-byte blocks. The first pair of bytes are bytes 0 and 1 from the offset of 0x3480, the second block contains bytes 2 and 3 from the offset. Block 6 contains bytes 0xA and 0xB or 10 and 11 in decimal notation. These are bytes 13450 and 13451. And we can see that these are the first bytes that differ. The first five pairs of bytes are the same in both files.
However, because diff
counts from base zero what cmp
Calls 13451 will be byte 13540 diff
. And to make things even more confusing, the byte order is reversed in each two-byte block diff
. The bytes are actually listed in this order: 1 and 0, 3 and 2, 5 and 4, 7 and 6, and so on.
The command is also computationally intensive – two hexdumps
and a diff
all at once – especially when the files being compared are large.
But if hexdump -C
can send an ASCII version of the binary to the terminal window, why don’t we redirect the output to text files and then compare those two text files with diff
?
hexdump -C binary_file1.so > binary1.txt
hexdump -C binary_file2.so > binary2.txt
diff binary1.txt binary2.txt
The difference between the two files is shown in two short snippets. Next to it is an ASCII representation. There will be a pair of extracts for each difference between the files. In this example there is only one difference.
That’s all very nice, but wouldn’t it be great if there was something that would do all of that for you?
VBinDiff
The VBinDiff program can be installed from the usual repositories for all major distributions. Use this command to install it on Ubuntu:
sudo apt install vbindiff
On Fedora you need to type:
sudo dnf install vbindiff
Manjaro users must use pacman
.
sudo pacman -Sy vbindiff
To use the program, pass the name of the two binaries on the command line.
vbindiff binary_file1.so binary_file2.so
The terminal-based application opens showing both files in a scrolling view.
You can use the mouse scroll wheel or the Up Arrow, Down Arrow, Home, End, Page Up, and Page Down keys to move through the files. Both files scroll.
Press the “Enter” key to jump to the first difference. The difference is highlighted in both files.
If there were more differences, pressing “Enter” would show the next difference. Pressing “q” or “Esc” will end the program.
What is the difference?
If you work on a computer owned by someone else and you are not allowed to install packages, you can use cmp
, diff
and hexdump
. If you need to capture the output for further processing, these are also the tools you should use.
But if you are allowed to install packages, VBinDiff makes your workflow easier and faster. And actually, using VBinDiff with a single binary is an easy and convenient way to search through binaries, which is a nice bonus.
TIED TOGETHER: How to peek into binaries from the Linux command line