Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You laugh now, but when we develop a trivial method to calculate pi and other irrational constants to quadrillions of digits, this will be wonderful.


By the pigeonhole principle, no matter how fast you can calculate pi, you cannot actually use this to compress data. The index to relevant sequence is on average >= the size of the data to be stored.


I was thinking so hard trying to give myself a clear explanation of why this type of compression was impossible! Thank you.

Here is the pigeonhole principle on wikipedia:

http://en.wikipedia.org/wiki/Pigeonhole_principle


The pigeonhole principle says that you cannot use pi to compress all data. You could certainly use this scheme to compress some data. For example, if you were storing a file containing the contents of pi (from the beginning), it would compress very well with this scheme.

This is the same tradeoff that all compression schemes make. gzip can compress some files, but others will grow. The hope is that the domain of files you actually want to apply it to will shrink instead of grow.

That said, given that the digits of pi are more or less uniformly distributed, there is no particular reason to believe that common bit patterns will be likely to appear sooner in pi than uncommon bit patterns. So it is unlikely that it will compress very many of the files you'd have sitting around on your hard drive.

But the pigeonhole principle alone does not prove this.


I know :D I tried it for past 6 months and then gave up on it 2 months back, for the same reason. how ever that search gave rise to a better idea to compress files. Not yet usable but don't have all these issues with it. If you are interested you can read it at https://www.academia.edu/7620004/Advanced_Compression_Techni... It is based on data being represented as a position and not as the values in a computable number series .


I don't get how this applies.

If you have a 100-gigabyte file you want to "compress" with Pi, all you have to do is find the beginning of that exact sequence in Pi and write down its location and the size (100gigabytes). As long as the binary representation of the location within Pi was less than 100 gigabytes, it is now "compressed". Why wouldn't this work?


> Why wouldn't this work?

It would work, exactly like you say. However, you seem to intuitively be vastly underestimating how far into pi you have to go to find a particular bit pattern.

To sharpen your intuition of this, I recommend this website: http://pi.nersc.gov/

I tried searching pi for the string "zzzz" (this is 20 bits of information according to their somewhat weird encoding scheme). The decimal position in pi for this 20-bit string was 3725869808, which takes 32 bits to represent.

The same will be true in most cases -- the offset into pi will take more space to represent than the data itself! However, in some cases data absolutely can be compressed with this pi scheme. Just not often enough to be actually useful.

However, the fact that the data will usually be larger is not what the pigeonhole principle is about.

All the pigeonhole principle says is: it's not possible for every 100 GB file to have a offset in pi that takes less than 100GB to represent. The pigeonhole principle still allows some files to be compressable with pi, just not all.

To prove this is simple: take every possible 100GB file (there are 2100G of them). Now let's suppose that for every single one you can actually find a location in pi whose binary representation is less than 100GB to represent. If you can do this, then it means that at least 2 of the input files mapped to the same location in pi (because there were 2100G distinct input files but less than 2100G pi offsets that we are allowed to use). Therefore, once "compressed", you can't tell the difference between the two input files that both mapped to the same location in pi!


Aw shoot, where my comment above says "2100G" that was supposed to be "2^100G". HN ate my double-asterisk.


Let's say you want to store a 100kB file. There are 2^(8 x 102400) possible different 100kB files. The lowest amount of data you need to be able to distinguish the one you want from all the possible candidates is 100kB. If you could compress any file with the same algorithm, it would mean that you would necessarily have to be able to decompress the exactly same compressed file into multiple different files.

Compression can only work when for every byte you take out of a file you want you need to concede to add another byte to the representation of a file you don't care about.

> write down its location and the size

The location of any arbitrary sequence in pi is on average a much larger number than the sequence itself.


I don't get this at all. You're talking about multiple files, which has nothing at all to do with my proposal. And I have no idea what your second paragraph means at all.

The idea that a single reference point in a set is larger than the entire set itself is also ridiculous. If I have a 100-gigabyte file, and I want to point to the position in the middle, that is position 53687091200. That is an 11 byte string pointing to a single point in a 107 billion byte long file.

You just do the same thing with Pi, except in this case Pi is the 107-billion byte file, and position 53687091200 points to the file's beginning point. The only thing you store is 11 bytes + the size of the file. (Of course wouldn't store the location in ASCII as binary would be a much smaller representation, but for these purposes i'm just using ASCII) Obviously the size of Pi is infinite and the size of the position would be much larger than 50-billion in, but the idea is the same.

Please tell me in simpler terms how this idea does not work?


> and the size of the position would be much larger than 50-billion in, but the idea is the same.

The crucial part is that the size of the position will be larger than the file you are seeking. This is unintuitive, but necessarily true for most possible files because of the pigeonhole principle. If you don't believe me, just try it out. Decide on random 4-digit sequences in pi and find their indexes. The mean of their indexes will be >4 digits. Same is true of any sequences of any length.

> I don't get this at all. You're talking about multiple files, which has nothing at all to do with my proposal.

No, I'm talking about multiple possible files, of which you must be able to compress and decompress any single one for your proposal to work in general.

General compression is mathematically proven to be impossible. If you have a compression algorithm f() which can take any input, in order for it to be able to compress some input of length x into a form that is shorter than x, it must also compress some other input of length x into a form that is longer than x.

Or, to put in another way, the sum of len(f(y)) for all possible y where len(y) = x is always greater than or equal to sum of len(y).


Let's say you wanted to encode a single decimal digit using this PI method. So, to start, you list out PI until you get all the possible decimal digits:

    03 14159 26535 897
    01 23456 78901 234
There, that's all of them. I cheated a bit and put a zero in the front, as the first zero in the expansion is another 10 digits or so away. So now, we can represent any decimal digit by indexing into PI!

    0 - 0
    1 - 2
    2 - 7
    3 - 1
    4 - 3
    5 - 5
    6 - 8
    7 - 14
    8 - 12
    9 - 6
So, what did we accomplish? Well, zero stays the same. We successfully renamed 1-6 and 9 to be new numbers. And now 7 and 8 take two decimal digits.

Now, extrapolate this out to every possible 100GB file, and the binary representation of PI.


For most candidates, the binary representation of the location will be 100GB or more.


I bet I could take a few moments and come up with a proof that such a method would require P=NP.


I can't find an authoritative source on this, but the algorithm used by SuperPi: https://en.wikipedia.org/wiki/Gauss%E2%80%93Legendre_algorit... seem polynomial in the number of digits you want to calculate at first glance.


Even if that's the case, wouldn't the number of digits of pi that must be computed to find a n occurrence of a random bit string of length n not be polynomial in n?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: