Talk:Zipf's law

This is an old revision of this page, as edited by The Anome (talk | contribs) at 09:56, 18 April 2002 ( I can't right now, but I'll give you the reason for the skewed probabilities -- the space is by far the most common character in English, and other chars have different probabilities -- I wanted to m). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Is it true that the word "the" does indeed occur about twice as often as the next common English word? The rest of the article seems to allow for some proportionality constants. AxelBoldt

It's not true, so I replaced it with a statement about Shakespeare's plays. AxelBoldt


The main article claimed that the frequency distribution of

  • words generated by random typing

follows Zipf's law. I doubt that very much. For one thing, if you type randomly, all words of length one will be equally likely, all words of length 2 will be equally likely and so on. Or am I missing something? Maybe we should perform a little perl experiment. AxelBoldt

I tried this Python code, and plotted the results in a log-log plot -- the early ranks are a bit stepped, but the overall pattern fits Zipf's law rather well. The Anome

import random
import string
import math

N=10000
M=100

words = {}
for j in range(M):
  str = []
  for i in range(N):
    str.append(random.choice('aaabccdefg     '))
  str = string.join(str, '')
  str = string.split(str)
  for word in str:
    if words.has_key(word):
        words[word] += 1
    else:
        words[word] = 1
  print 'did string pass', j
  
vals = words.values()
vals.sort()
vals.reverse()

file = open('zipf_ranks.txt', 'w')
# Let's just have the first few ranks
useranks = min(1000, len(vals))
for i in range(useranks):
    rank = i+1
    file.write("%d %d\n" % (rank, vals[rank-1]))

Could you repeat the experiment with all letters and the space getting the same probability? That's at least what I thought off when I heard "random typing". AxelBoldt


I can't right now, but I'll give you the reason for the skewed probabilities -- the space is by far the most common character in English, and other chars have different probabilities -- I wanted to model that. The Anome