我正在使用python和nltk来研究一些文本,我想比较不同文本中词性的频率分布。
我可以用一个文字来做:
from nltk import *
X_tagged = pos_tag(word_tokenize(open('/Users/X.txt').read()))
X_fd = FreqDist([tag for word, tag in X_tagged])
X_fd.plot(cumulative=True, title='Part of Speech Distribution in Corpus X')
我试图添加另一个但没有太多运气。我有条件频率分布示例来比较几个文本中三个单词的计数,但我希望这些行代表四个不同的文本,y轴代表计数,x轴代表不同的文本词性。如何在同一图表中比较文本Y和Z?
答案 0 :(得分:3)
FreqDist.plot()
方法只是一种方便的方法。
您需要自己编写绘图逻辑(使用matplotlib)以在一个图中包含多个频率分布。
FreqDist
的绘图功能的source code可能是让你入门的神点。 matplotlib也有一个很好的tutorial和初学者指南。
答案 1 :(得分:3)
如果有人感兴趣,我想出来了;你需要获得单独的频率分布并将它们输入到一个字典中,其中包含所有FreqDists共有的键和一个表示每个FreqDists结果的值元组,然后你需要绘制每个FreqDist的值并设置键作为xvalues,按照相同的顺序将它们拉出来。
win = FreqDist([tag for word, tag in win]) # 'win', 'draw', 'lose' and 'mixed' are already POS tagged (lists of tuples ('the', 'DT'))
draw = FreqDist([tag for word, tag in draw])
lose = FreqDist([tag for word, tag in lose])
mixed = FreqDist([tag for word, tag in mixed])
POS = [item for item in win] # list of common keys
results = {}
for key in POS:
results[key] = tuple([win[key], draw[key], lose[key], mixed[key]]) # one key, tuple of values for each FreqDist (in order)
win_counts = [results[item][0] for item in results]
draw_counts = [results[item][1] for item in results]
lose_counts = [results[item][2] for item in results]
mixed_counts = [results[item][3] for item in results]
display = [item for item in results] # over-cautious, same as POS above
plt.plot(win_counts, color='green', label="win") # need to 'import pyplot as plt'
plt.plot(draw_counts, color='blue', label="draw")
plt.plot(lose_counts, color='red', label="lose")
plt.plot(mixed_counts, color='turquoise', label="mixed")
plt.gca().grid(True)
plt.xticks(np.arange(0, len(display), 1), display, rotation=45) # will put keys as x values
plt.xlabel("Parts of Speech")
plt.ylabel("Counts per 10,000 tweets")
plt.suptitle("Part of Speech Distribution across Pre-Win, Pre-Loss and Pre-Draw Corpora")
plt.legend(loc="upper right")
plt.show()
答案 2 :(得分:0)
以下是使用matplotlib的示例:
true