正如titel所说,我想衡量语料库中单词的Zipf分布是否表现得像预期的那样。我已经在stackoverflow和其他页面上查看过类似的问题。到目前为止,我最喜欢的答案是Anil_M的Zipf Distribution: How do I measure Zipf Distribution。
import re
from operator import itemgetter
import matplotlib.pyplot as plt
from scipy import special
import numpy as np
#Get our corpus of medical words
frequency = {}
open_file = open('d2016.bin', 'r')
file_to_string = open_file.read()
words = re.findall(r'(\b[A-Za-z][a-z]{2,9}\b)', file_to_string)
#build dict of words based on frequency
for word in words:
count = frequency.get(word,0)
frequency[word] = count + 1
#limit words to 1000
n = 1000
frequency = {key:value for key,value in frequency.items()[0:n]}
#convert value of frequency to numpy array
s = frequency.values()
s = np.array(s)
#Calculate zipf and plot the data
a = 2. # distribution parameter
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
x = np.arange(1., 50.)
y = x**(-a) / special.zetac(a)
plt.plot(x, y/max(y), linewidth=2, color='r')
plt.show()
对我来说,这段代码无法正常工作。当我尝试运行此代码时,出现以下错误代码:
frequency = {key:value for key,value in frequency.items()[0:n]}
TypeError: 'dict_items' object is not subscriptable
如果我省略了代码的[0:n]部分(我不需要限制字数),则会出现另一个错误代码:
count, bins, ignored = plt.hist(s[s<50], 50, normed=True)
TypeError: '<' not supported between instances of 'dict_values' and 'int'
自从两年前编写代码以来,库似乎有所改变。有什么办法可以使该代码再次起作用?我真的很喜欢您如何清楚地看到单词分布以及精确的zipf分布。另一种更简单的方法是:
from nltk import word_tokenize
import nltk
from nltk.book import *
import matplotlib
import numpy
f = open('test.txt', encoding='utf8')
raw = f.read()
tokens = word_tokenize(raw)
tokensNLTK = Text(tokens)
fdist1 = FreqDist(tokensNLTK)
print(fdist1.most_common(50))
fdist1.plot()
但是使用此功能,您无法很好地评估分布。