Is it true that the word "the" does indeed occur about twice as often as the next common English word? The rest of the article seems to allow for some proportionality constants. AxelBoldt
It's not true, so I replaced it with a statement about Shakespeare's plays. AxelBoldt
The main article claimed that the frequency distribution of
- words generated by random typing
follows Zipf's law. I doubt that very much. For one thing, if you type randomly, all words of length one will be equally likely, all words of length 2 will be equally likely and so on. Or am I missing something? Maybe we should perform a little perl experiment. AxelBoldt
I tried this Python code, and plotted the results in a log-log plot -- the early ranks are a bit stepped, but the overall pattern fits Zipf's law rather well. The Anome
import random import string import math N=10000 M=100 words = {} for j in range(M): str = [] for i in range(N): str.append(random.choice('aaabccdefg ')) str = string.join(str, '') str = string.split(str) for word in str: if words.has_key(word): words[word] += 1 else: words[word] = 1 print 'did string pass', j vals = words.values() vals.sort() vals.reverse() file = open('zipf_ranks.txt', 'w') # Let's just have the first few ranks useranks = min(1000, len(vals)) for i in range(useranks): rank = i+1 file.write("%d %d\n" % (rank, vals[rank-1]))
Could you repeat the experiment with all letters and the space getting the same probability? That's at least what I thought off when I heard "random typing". AxelBoldt
I can't right now, but I'll give you the reason for the skewed probabilities -- the space is by far the most common character in English, and other chars have different probabilities -- I wanted to model that. The Anome