Question

我想知道计算文档中单词的最佳方法。如果我有自己的“corp.txt”语料库设置，我想知道“corp.txt”文件中出现“学生，信任，ayre”的频率。我可以用什么？

它会是以下之一：

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

谢谢，射线

Answer 1

我建议调查集合。计数器。特别是对于大量文本，这可以解决问题，并且仅受可用内存的限制。它在一天半的时间里计算出了带有12Gb内存的计算机上的30亿个令牌。伪代码（变量单词实际上是对文件或类似文件的引用）：

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

完成后，单词会出现在字典my_counter中，然后可以写入磁盘或存储在别处（例如sqlite）。

Answer 2

大多数人只会使用defaultdictionary（默认值为0）。每次看到一个单词时，只需将值增加一个：

total = 0
count = defaultdict(lambda: 0)
for word in words:
    total += 1
    count[word] += 1

# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
     print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))

Answer 3

你快到了！您可以使用您感兴趣的单词索引FreqDist。请尝试以下方法：

print fdist['students']
print fdist['ayre']
print fdist['full']

这将为您提供每个单词的计数或出现次数。你说“频率” - 频率与出现次数不同 - 可以这样：

print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')

Answer 4

您可以阅读文件然后将其标记并将各个令牌放入FreqDist中的NLTK对象，请参阅http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html

from nltk.probability import FreqDist
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
    for word in word_tokenize(fin.read()):
        fdist.inc(word)

print "'blah' occurred", fdist['blah'], "times"

[OUT]：

'blah' occurred 3 times

或者，您可以使用Counter中的原生collections对象获得相同的计数，请参阅https://docs.python.org/2/library/collections.html。请注意，FreqDist或Counter对象中的键区分大小写，因此您可能还希望将标记大小写为小写：

from collections import Counter
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
    fdist.update(word_tokenize(fin.read().lower()))

print "'blah' occurred", fdist['blah'], "times"

如何计算语料库文档中的单词

4 个答案: