NLTK FreqDist,绘制标准化计数?

时间:2016-07-27 15:21:06

标签: python nlp nltk normalization probability

在NLTK中,您可以轻松计算文本中单词的计数,例如,执行

from nltk.probability import FreqDist
fd = FreqDist([word for word in text.split()])

其中text是一个字符串。 现在,您可以将分布绘制为

fd.plot()

这将为您提供一个很好的线图,其中包含每个单词的计数。在docs中,没有提及绘制实际频率的方法,您可以在fd.freq(x)中看到。

绘制标准化计数的任何直接方法,不将数据转换为其他数据结构,分别进行标准化和绘图?

2 个答案:

答案 0 :(得分:2)

您可以使用 fd [word] / total

更新 fd [word]
from nltk.probability import FreqDist

text = "This is an example . This is test . example is for freq dist ."
fd = FreqDist([word for word in text.split()])

total = fd.N()
for word in fd:
    fd[word] /= float(total)

fd.plot()

注意:您将丢失原始的FreqDist值。

答案 1 :(得分:0)

原谅缺乏文件。在nltk中,FreqDist为您提供了文本中的原始计数(即单词的频率),但ProbDist为您提供了给出文本的单词的概率。

有关详细信息,您必须执行一些代码阅读:https://github.com/nltk/nltk/blob/develop/nltk/probability.py

进行规范化的特定行来自https://github.com/nltk/nltk/blob/develop/nltk/probability.py#L598

为了得到标准化的ProbDist,您可以执行以下操作:

>>> from nltk.corpus import brown
>>> from nltk.probability import FreqDist
>>> from nltk.probability import DictionaryProbDist
>>> brown_freqdist = FreqDist(brown.words())
# Cast the frequency distribution into probabilities
>>> brown_probdist = DictionaryProbDist(brown_freqdist)
# Something strange in NLTK to note though
# When asking for probabilities in a ProbDist without
# normalization, it looks it returns the count instead...
>>> brown_freqdist['said']
1943
>>> brown_probdist.prob('said')
1943
>>> brown_probdist.logprob('said')
10.924070185585345
>>> brown_probdist = DictionaryProbDist(brown_freqdist, normalize=True)
>>> brown_probdist.logprob('said')
-9.223104921442907
>>> brown_probdist.prob('said')
0.0016732805599763002