[Academic] Zipf, Power-laws, and Pareto

Zipf, Power-laws, and Pareto是在Paper上常常見到。這三個名詞常用來描述一個現象"large events are rare, but small ones quite common"。例如,大的地震發生次數很少,小的地震發生次數確很多。大都市很少,小的鄉鎮很多。一些字很少見,但一些字則常常出現(如 'and' and 'the' ).

Reference: http://ginger.hpl.hp.com/shl/papers/ranking/ranking.html

Zipf's law 通常用來描述一個事件發生的數量或頻率( 'size')跟它的排名(rank) r有關. George Kingsley Zipf一位語言學家想找出某英文字出現的頻率有多高,例如 3rd 或 8th或100th常見字出見的次數。Zipf's law指出事件出現的頻率y跟它的排名(rank) r成反比:

y ~ r^-b, with b close to unity.

Pareto則是對大家的收入(income)分佈有興趣。相對於想知道第r個人的收入,Pareto想知道有多少人的收入大於x. Pareto's law使用累積分佈(cumulative distribution function (CDF))來找出分佈狀況。, i.e. the number of events larger than x is an inverse power of x:

P[X > x] ~ x^-k.


Power law distribution對有多少人的收入大於x不感興趣,相對的,它對有多少人收入洽好是x感興趣. It is simply the probability distribution function (PDF) associated with the CDF given by Pareto's Law. This means that

P[X = x] ~ x^-(k+1) = x^-a.

That is the exponent of the power law distribution a = 1+k (where k is the Pareto distribution shape parameter).

近年來,很多學者發現internet顯示出很多符合power-law distributions的現象: the number of visits to a site, the number of pages within a site, and the number of links to a page, to name a few.

圖1a展示了AOL users'訪問不同網站的圖形on a December day in 1997. 從圖中,可以發現只有很少的網站有超過2000位訪客, 大多數的網站只有很少的訪客 (70,000 網站只有一位訪客). The distribution is so extreme that if the full range was shown on the axes, the curve would be a perfect L shape. Figure 1b below shows the same plot, but on a log-log scale the same distribution shows itself to be linear. This is the characteristic signature of a power-law.

Fig. 1a Linear scale plot of the distribution of users among web sites

Fig. 1b Log-log scale plot of the distribution of users among web sites

Let y = number of sites that were visited by x users.

In a power-law we have y = C x-a which means that log(y) = log(C) - a log(x)

So a power-law with exponent a is seen as a straight line with slope -a on a log-log plot.

總結,Zipf, Power-laws, and Pareto都是在描述同一個現象,只是用不同的觀點罷了。

