Friday, June 26, 2009

Pareto, Zipf, Heap: The 80-20 Rule, Language, and Diminishing Returns

Please consider this a sort of layman’s disambiguation page.

The classic “80-20 rule” refers to a Pareto distribution.  The thin distribution of the 20% is the subject of Chris Anderson’s “The Long Tail”.  Originally, the Pareto distribution referred to the fact that 20% of the people control 80% of the wealth, but it has turned up in many other contexts.

A Zipf’s Law is about rankings and frequency.  The second item’s frequency will be half of the first; the third’s frequency will be half the second, and so on.  The “half” may be some other factor, but it remains constant in the distribution; the rank is inversely proportional to the frequency. The item in Zipf’s Law is a word and its frequency is its appearance in a corpus of English text.  However, Zipf’s Law holds for texts generated from a fixed alphabet by picking letters randomly with a uniform distribution.

Heaps’ Law is about diminishing returns.  It has an exact formula, but generally it says that the more you look into a text, the fewer new discoveries of words you’ll find.  So, as you read through the text it takes longer and longer to find new words in the text.  Heaps’ Law applied to the general case where the “words” are just classifiers of some collection of things.  So, it could be applied to the nationality of a collection of people; you’d have to gather more and more people from a random sampling to get a representative from all countries.

The implications of these laws in various contexts are the subject of much interesting study and postulation.