What is Compression?
Compression is the conversion process of reducing the size of a file by encoding its data information, performed so that the data can be stored or transmitted more efficiently. This compression can be achieved on data but also on a special kind of data: the binary file. This binary file can be in the form of an executable or a dynamic link library (DLL) or any other kind of binary files. Either way, the result is a reduction in the number of bits and bytes, leading to a smaller file size. The size of the data in compressed form relative to its original size is known as the compression ratio. Ratios can differ big time depending on the algorithm used and depending on the nature of the file to be compressed.
It happens still too frequent to run out of disk space, even though modern PCs tend to be equipped with relatively large hard drives. A similar problem arises when sending or receiving files over the internet. It can take really long to send a big file and even extremely long on a slow connection. So what can be done to remedy? The answer is to compress the files so they take up less room and sending time.
How to use compression?
One way is to use programs that are specifically created to compress and decompress files. Once compressed, files mostly can not be used until they are decompressed again. Thus, compression is good for archival or for emailing. A well-known example of a compression technology is ZIP, a common standard for compressing data files. For binaries, this way is not possible because the compressed executable would loose all starting capabilities as it needs to be self-contained (see below how this is solved in binaries). Compression is also used in many cases without the user realizing it. A modem uses a form of compression when it sends and receives data. Another example is a graphic in JPEG format.
How does compression work?
When you have a file containing text, there can be repetitive single words, word combinations and phrases that use up far too much storage space to be productive. The same applies for binary files with repetitive bits and bytes. There can be media such as images whose data information occupies much more space than necessary. However, the document or file can be compressed to reduce this inefficiency electronically.
How to achieve compression?
Compression is done by using compression algorithms (formulae) that rearrange and reorganize data information so that it can be stored more economically. By encoding information, data can be stored using fewer bits. This is done by using a compression/decompression program that alters the structure of the data temporarily. Compression reduces information by using different and more efficient ways of representing the information. Methods may include simply removing spaces, using two characters to represent a string of repeated characters or substituting larger bit sequences by smaller ones. Certain compression algorithms go as far as to delete information completely to achieve a smaller file size. Depending on the algorithm used, files can be adequately reduced in regard to their original size.
Are there different systems?
If the inverse of compression, decompression, produces an exact replica of the original data then the compression is lossless. The other kind, the lossy compression, usually applied to image data, does not allow reproduction of an exact replica of the original image, but has a higher compression ratio. Thus lossy compression allows only an approximation of the original to be regenerated.
What is lossy compression?
Lossy compression reduces files by eliminating bits of data that hopefully are not necessary. MP3 is such a system, it relies on the way the brain interprets audio and uses various tricks to produce something which sounds almost the same but is actually missing as much as 90% of the data. Another lossy system is JPG. It is designed to provide high compression for images. For instance, in a picture containing a landscape with a blue sky, all the slightly different shades of green and blue are eliminated. The essential nature of the data is not lost because the basic colours are still present. Large portions of the picture will be equally colored, perhaps even whole lines or surfaces, but the image will still remain the same for the human eye.
What is lossless compression?
Lossless compression is a type of compression that is able to reduce file sizes without a loss of information. The original file can be recreated to exactly the same when decompressed. To achieve this, algorithms create reference points for patterns, store them in a table and send the table along with the now smaller encoded file. When decompressed, the file is re-generated by substituting the referenced points with the original information.
When to use lossless compression?
Lossless compression is ideal for documents containing text and numerical data where loss of information can’t be tolerated. ZIP compression, for instance, is a Lossless compression that detects patterns and replaces them with only one character (plus an indicator). This relies on the fact that most files contain large amounts of space or repetitive data. As an example, remark that in this text you are reading right now, the word compression appears again and again, each one taking 11 bytes of storage (one for each letter). A compression system remarks this and after the first occurrence, rather than store the actual word, it stores a one byte indicator to indicate it is a repeat word plus a byte to indicate which word it is. The result is that each occurrence of compression now needs 2 bytes and not 11, a saving of 9 bytes and over 80% of space for that word. If repeating that process for the 256 most common words, it can make quite a difference to the size of the file. When decompressing the file, the decompression program finds these codes for repeated words and restores the full words in their place thus restoring the document to its original size and content.
What are the results?
The success of data compression depends largely on the data itself because some data types are inherently more compressible than others. Generally, some elements within the data are more common than others and most compression algorithms exploit this property, known as redundancy. The greater the redundancy within the data, the more successful the compression of the data will be. In this regard, digital video has a high redundancy which makes it very suitable for compression.
A device (software or hardware) that compresses data is often know as an encoder or coder, whereas a device that decompresses data is known as a decoder. A device that acts as both a coder and decoder is known as a codec. A great number of compression techniques have been developed and some lossless techniques can be applied to any type of data. In recent years, development of lossy techniques, specifically for image data, has contributed a great deal to the realisation of digital video applications. Okay, so far for the compression in general, but what about compression on binaries?
Like mentioned before, a compressed executable (or DLL) must be self-contained. Hence, it must be a self-extracting archive where compressed data is packaged together with the decompression code into an executable file. This way, there is no separate program needed to execute a compressed executable file. This decompression code that is added to the compressed data is often called the decompression stub. Running a compressed executable essentially means that the decompression stub unpacks the original executable code before passing control to the recomposed original binary. The effect is the same as if the original executable had been run. To the casual user, compressed and not compressed executables are indistinguishable.
What is packing?
The act of compressing an executable or DLL file is often referred to as packing, a typical name for an executable compressing program then becomes a packer. Most packed executables decompress directly in memory and need no external file system space to start. However, some decompressor stubs are known to write the uncompressed executable to the file system in order to start it.
Why use packers?
Software distributors use executable compression for a variety of reasons, primarily to reduce the storage requirements of software. Executable compressors are specifically designed to compress executable code, that is why they often achieve better compression ratios than standard data compression programs. Software compression allows distributors to stay within the constraints of their chosen distribution media (CD, DVD,…), or to reduce the time and bandwidth customers require to access software distributed via the internet. There exists also another reason for compression: executable compression is also frequently used to deter reverse engineering or to obfuscate the contents of the executable by proprietary methods of compression and/or added encryption. Malware is known to be compressed in many of the cases, to hide their presence from antivirus scanners. Executable compression can be used to prevent direct disassembly, mask string literals and modify signatures. However, executable compression does not eliminate the chance of reverse engineering, it can only slow down the process. In general, compression-only is totally insufficient to circumvent cracking, much more reliable are protectors for that purpose.
Is the compressed executable slower?
A compressed software requires less storage space in the file system, thus taking less time to map its data from the file system into memory. On the other hand, it requires some time to decompress the data before execution begins. However, the speed of various storage media has not kept up with average processor speeds, so the storage is very often the bottleneck. Thus the compressed executable will load faster on most common systems. This is sort of theoretical though as on modern desktop computers, this is rarely noticeable unless the executable is unusually big, so loading speed is not a primary reason in favor of or against compressing an executable. Software compression allows to store more software in the same amount of space, without the hassle of having to manually unpack the archived file every time the user wants to use the software.
And for 64 bit (x64) systems?
Data compression for 32 bit or 64 bit is obviously exactly the same for both systems. Also, the compression for 32 and 64 bit executables results in comparable ratios. In fact, anything that is said in general is also true specifically for the 64 bit software. Though sizes between 32 bit and 64 bit softwares for the original executables differ slightly in favour of the 32 bit system, there is often a better ratio for the 64 bit software compression because there are more of the same patterns in this software (only the same number of bits and bytes exist for both). This makes that it is even more advisable to compress 64 bit software for reasons of reducing space and time in comparison to 32 bit software.