The data compression world is all abuzz about Marcus Hutter’s recently announced 50,000 euro prize for record-breaking data compressors. Marcus, of the Swiss Dalle Molle Institute for Artificial Intelligence, apparently in cahoots with Florida compression maven Matt Mahoney, is offering cash prizes for what amounts to the most impressive ability to compress 100 MBytes of Wikipedia data. (Note that nobody is going to win exactly 50,000 euros – the prize amount is prorated based on how well you beat the current record.)
This prize differs considerably from my Million Digit Challenge, which is really nothing more than an attempt to silence people foolishly claiming to be able to compress random data. Marcus is instead looking for the most effective way to reproduce the Wiki data, and he’s putting up real money as an incentive. The benchmark that contestants need to beat is that set by Matt Mahoney’s paq8f , the current record holder at 18.3 MB. (Alexander Ratushnyak’s submission of a paq variant looks to clock in at a tidy 17.6 MB, and should soon be confirmed as the new standard.)
So why is an AI guy inserting himself into the world of compression? Well, Marcus realizes that good data compression is all about modeling the data. The better you understand the data stream, the better you can predict the incoming tokens in a stream. Claude Shannon empirically found that humans could model English text with an entropy of 1.1 to 1.6 0.6 to 1.3 bits per character, which at at best should mean that 100 MB of Wikipedia data could be reduced to 13.75 7.5 MB, with an upper bound of perhaps 20 16.25 MB. The theory is that reaching that 7.5 MB range is going to take such a good understanding of the data stream that it will amount to a demonstration of Artificial Intelligence.
At least, that’s my take on it. When the prize was first announced on comp.compression that line of thought immediately stirred up quite a bit of controversy, with the Hutter position being represented in a post by Mahoney:
There are many arguments that compression implies AI (but not vice versa).
1. Hutter proved that the optimal behavior of a rational agent in a computable environment (to maximize an accumulated reward signal from the environment) is to assume the environment is controlled by the shortest program consistent with all observations so far.
2. A natural language model (the ability to compute the probabilty p(s) of any text string s in human dialog) enables one to pass the Turing test for AI.
3. Humans exceed machines in the ability to predict successive characters or words in running text, as measured by entropy bounds (estimated from the Shannon game, or compressed size, respectively). Text prediction requires vast, real-world knowledge that machines lack.
These arguments are described in more detail here.
So part of this is predicated on the assertion that passing the Turing Test is equivalent to AI, and that opens up whole new can of worms. (Try explaining why this is true to an undergraduate class on Intro to Cognitive Science if you want to get a handle on the passion it can raise.)
Personally, I think leadership in the Hutter Prize will be gained using many of the same techniques that have been used to develop winning chess programs. The best compression programs (such as paq8) are big bags of heuristics, hacks, and code tuned for specific models. If it turns out that intelligence is coded that way, I won’t be surprised, but those hoping for an elegant model will probably be disappointed.
To keep up with ongoing developments in the Hutter Prize chase, use Google Groups to subscribe to the newsgroup. Awards will be given every time someone leapfrogs the current winner, so expect interest in the prize to stay high for the forseeable future.
8 users commented in " The Hutter Prize "
Follow-up comment rss or Leave a TrackbackIt is true today that data compression programs are not very smart, but I think that will have to change to make progress in the Hutter competition. English has a lot of redundancy that is difficult to exploit. For example, it is hard for a program to know that “roses are red” is more likely than “roses are green” or even “roses red are”. A compressor that can learn syntax and semantics from text and apply that knowledge to predict future text will do better than one that can’t.
Also, I believe Shannon estimated the entropy of written English to be about 0.6 to 1.3 bits per character. I got those numbers by eyeballing a graph in Shannon’s original paper, “Prediction and Entropy of Printed English”, Bell Sys. Tech. J (3) p. 50-64, 1950. The paper is not online as far as I know. I don’t know where Wikipedia got the numbers 1.1 to 1.6 bpc. Cover and King (1978) used a text prediction gambling game to estimate the upper bound at 1.3 bpc.
Thanks for the correction on the Shannon estimation, Matt. Lacking an online copy of the paper, I checked my copy of Text Compression by Cleary, Witten, and Bell, and they cite the same upper and lower bounds as you, 0.6 to 1.3.
Which is good, because it means that there is still a lot of room to go for Hutter prizewinners. If the 0.6 lower bound were to hold, we could see the 100MB corpus reduced to 7.5 MB, roughly half the size of the best efforts to date. And I am a believer, to get to that point is going to require something that starts to look like intelligence. (And as Turing proposes, if it looks like intelligence…)
Alexander Ratushnyak’s latest (Aug 29) entry , paq8hp3, if confirmed, compresses enwik8 to 1.394 bits per character. This represents nearly a 5% improvement since the contest began whereas historically the improvement has been approximately 3% per year.
When Mr. Ratushnyak achieved 4% improvement with paq8hp2, some speculated that all “low hanging fruit” was “gone, gone, gone”.
For Marcus Hutter’s sake I hope there is a reprieve for at least a few months so some other donors can feel comfortable placing their money “at risk” in this purse.
However, there is still the potential that someone will soon use a serious language model and push the compression ratio to the top end of Shannon’s estimate of human intelligence of 1.3 bits per character.
Looks like the existence of the prize, and perhaps the prize money, are having the desired effect of accelerating the rate of improvement.
Looks like somebody at Brown scanned the BSTJ article and put it on-line as a PDF: http://www.cs.brown.edu/courses/cs195-5/extras/shannon-1951.pdf
The quality is, erm, shall we say, a good test of human ability to reconstruct information from context.
era, i never know whether to bless them or curse them. it bothers me that academic papers are locked up and inaccessible to the general public behind pay-to-read journal sites like ieee eexplore, especially when you consider that we, the taxpayers, pay for 90% of the stuff that is published there.
on the other hand, it is copyrighted material, and under current law the authors do have the right to keep it private.
meanwhile i suppose i ought to snag a copy…
Translate into Chinese and you guys get probably beyond the Shannon limit. Perhaps Google will be able to do that.
1 Character = 2 Bytes.
1 Latin word = 7 Bytes = 1 Chinese word = 2 Characters = 4 Bytes
Use whatever conventional compression routine, then you get it.
Of course, this compression may lose information, or may gain more information, depending on how good is your translator. And it’s not 100% revertable.
Because you are using sample from Wiki, not random stuff, so there are already some internal information, or to say, quantum entanglement. Good luck on y’all!
@jiahb:
I don’t know how much this has been studied, but I have a strong feeling that using an order-0 or order-1 model to compress equivalent English and and Chinese texts will yield very similar results. I’d be suprised if there is much to gain by what amounts to tokenizing English text.
- Mark
Leave A Reply
You can insert source code in your comment without fear of too much mangling by tagging it to use the iG:Syntax Hiliter plugin. A typical usage of this plugin will look like this:[c]
Note that tags are enclosed in square brackets, not angle brackets. Tags currently supported by this plugin are: as (ActionScript), asp, c, cpp, csharp, css, delphi, html, java, js, mysql, perl, python, ruby, smarty, sql, vb, vbnet, xml, code (Generic). If you post your comment and you aren't happy with the way it looks, I will do everything I can to edit it to your satisfaction.int main()
{
printf( "Hello, world!\n" );
return 1;
}
[/c]