Last night, I had an interesting thought about file compression. I have no idea if this idea is new or not. First, let me take you through the train of thought that led me to it. I began by thinking about embedding an image within some text. One of the simplest methods for this would be something like Base64 encoding. Unfortunately, upon seeing some Base64 encoded text, many people would recognize it and could easily decode it and get the original image back. Perhaps then one could encode the image using more natural looking text. Maybe grouping characters into two groups (one to represent a 1 and the other to represent a 0) and then generating a string of words where every nth character encoded a bit of the image. It seems unlikely that one could generate text in this manner that would look like real natural text. I then began to wonder if one could take an existing text document and create an algorithm that could extract an image from the text. For example, say one selected a section of the Bible. Then group the characters into two groups (maybe consonants are 1s and everything else are 0s). Could one then pick a number n such that the every nth character of the section of the Bible encoded a bit of the desired image? Probably not for a set section of the Bible, but the Bible is a long book. Maybe one could search the entire Bible for a section where this could be done. If this were possible, one could encode the image as a starting point in the Bible (and maybe a length) and an algorithm to extract it. Wait a second! That sounds like compression!
Let’s expand the thought from images and text to all data. Could we pick some long file (let’s call it the master file) in place of the Bible and search it in a similar manner for any file (or subsections of a file)? The basic algorithm would look something like:
-Take a file or subsection of a file.
-Pick an algorithm (like check every nth bit).
-Search through the master file using the algorithm to see if the file or subsection can be found.
-If it is found, the file or subsection can be converted into a location in the master file and an algorithm.
-If it is not found, try a different algorithm or smaller subsection and try again.
Unfortunately, depending on the sizes involved, such an algorithm could be pretty slow. I began thinking about what kind of file could be used as the master file. I first considered files that are large (1 GB+), publicly available, and would continue to be publicly available for a very long time. For a truly large master file, one could possibly use the Bitcoin blockchain. Then I thought about just generating the master file. I figured I could use a random number generator to do it. But wait. If I’m going to use a random number generator to create the master file, why create the file at all? I could just generate it (or parts of it) at runtime as needed. It turns out that a master file isn’t necessary. One can simply use a bit generator function like the C library function rand() with a set seed. An extra step then needs to be added to the algorithm to select a bit generator function and to possibly vary it if the file or subsection isn’t found in the generated stream of bits.
If the file or subsection can be found in the generated bits, I think the compression would be very good; however, the likelihood of finding the file or subsection in the generated bits is probably pretty low. Let’s do some simple estimations. First, let’s considered what a compressed file or subsection might look like:
-Bit generator ID: 1 byte
-Bit generator parameter: 4 bytes (this would encode where in the generated stream to begin)
-Algorithm ID: 1 byte
-Algorithm parameter: 4 bytes
-Length (in bytes) of data to be generated by decompression: 8 bytes
That totals 18 bytes. This means that, AT MOST, this algorithm could compress only 218*8 possible files or subsections! That’s not very many. Nearly half of the compressed data is the length of the original data, which is very likely to be pretty small, so most of those bytes will always be 0, reducing our realistic possibilities even more. The probability of compressing a file or subsection of length L bytes would be (210*8)/(28L). This can be simplified to 28(10-L). Notice that for values of L < 10, we get a probability greater than 1. It doesn’t make sense to try to compress files smaller than what we could possibly compress them to anyway, so we can just ignore them. Let’s look at a few different lengths to get an idea of the probabilities:
Length | Probability |
22 | 1.26 x 10-29 |
100 | 1.81 x 10-217 |
1000 | 6.96 x 10-2385 |
5000 | 7.63 x 10-12018 |
Even at 22 bytes (50% compression if only 1 byte is used for length), the probability of compression is extremely low and it gets much lower from there. It is also worth noting that the bit generator ID and algorithm ID are not likely to use all 8 bits, further reducing our probabilities. After crunching the numbers, it looks like this sort of compression would not be viable, but exploring the idea was still interesting.