我想阅读一个文件并找到最常用的词。以下是代码。我假设读文件我犯了一些错误。任何建议将不胜感激。
txt_file = open('result.txt', 'r')
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
all_words = nltk.FreqDist(word for word in word.words())
top_words = set(all_words.keys()[:300])
print top_words
输入result.txt文件
Musik to shiyuki miyama opa samba japan obi Musik Musik Musik
Antiques antique 1900 s sewing pattern pictorial review size Musik 36 bust 1910 s ladies waist bust
答案 0 :(得分:1)
我不确定你的错误是什么,也不知道如何使用NLTK,但是你通过循环的方法,然后单词可以适应使用一个简单的python字典来跟踪计数:
txt_file = open("filename", "r")
txt_file.readLines()
wordFreq = {}
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
# If word is already in dict, increase count
if word in wordFreq:
wordFreq[word] += 1
else: #Otherwise, add word to dict and initialize count to 1
wordFreq[word] = 1
要查询结果,只需将dict中感兴趣的单词作为键提供,即wordFreq['Musik']
。
答案 1 :(得分:1)
from collections import Counter
txt_file = open('result.txt', 'r')
words = [word for line in txt_file for word in line.strip().split()]
print Counter(words).most_common(1)
而不是1
中的most_common
,您可以提供任意数字,并且会显示大量最常用的数据。例如
print Counter(words).most_common(1)
结果
[('Musik', 5)]
其中
print Counter(words).most_common(5)
给出
[('Musik', 5), ('bust', 2), ('s', 2), ('antique', 1), ('ladies', 1)]
该数字实际上是一个可选参数,如果省略,它将按降序给出所有单词的频率。