我想知道计算文档中单词的最佳方法。如果我有自己的“corp.txt”语料库设置,我想知道“corp.txt”文件中出现“学生,信任,ayre”的频率。我可以用什么?
它会是以下之一:
....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS
"students, trust, ayre" occur in full.
谢谢, 射线
答案 0 :(得分:4)
我建议调查集合。计数器。特别是对于大量文本,这可以解决问题,并且仅受可用内存的限制。它在一天半的时间里计算出了带有12Gb内存的计算机上的30亿个令牌。伪代码(变量单词实际上是对文件或类似文件的引用):
from collections import Counter
my_counter = Counter()
for word in Words:
my_counter.update(word)
完成后,单词会出现在字典my_counter中,然后可以写入磁盘或存储在别处(例如sqlite)。
答案 1 :(得分:3)
大多数人只会使用defaultdictionary(默认值为0)。每次看到一个单词时,只需将值增加一个:
total = 0
count = defaultdict(lambda: 0)
for word in words:
total += 1
count[word] += 1
# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))
答案 2 :(得分:2)
你快到了!您可以使用您感兴趣的单词索引FreqDist。 请尝试以下方法:
print fdist['students']
print fdist['ayre']
print fdist['full']
这将为您提供每个单词的计数或出现次数。 你说“频率” - 频率与出现次数不同 - 可以这样:
print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')
答案 3 :(得分:0)
您可以阅读文件然后将其标记并将各个令牌放入FreqDist
中的NLTK
对象,请参阅http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html
from nltk.probability import FreqDist
from nltk import word_tokenize
# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
fout.write(doc)
# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
for word in word_tokenize(fin.read()):
fdist.inc(word)
print "'blah' occurred", fdist['blah'], "times"
[OUT]:
'blah' occurred 3 times
或者,您可以使用Counter
中的原生collections
对象获得相同的计数,请参阅https://docs.python.org/2/library/collections.html。请注意,FreqDist或Counter对象中的键区分大小写,因此您可能还希望将标记大小写为小写:
from collections import Counter
from nltk import word_tokenize
# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
fout.write(doc)
# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
fdist.update(word_tokenize(fin.read().lower()))
print "'blah' occurred", fdist['blah'], "times"