所以在我们的任务中,我的教授希望我们逐行阅读文本文件,然后逐字逐句,然后创建一个字典,计算出现的每个单词的频率。这就是我现在所拥有的:
wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
for line in f:
for word in line.split():
line = line.lower()
word = word.strip(string.punctuation + string.digits)
if word:
wordcount[word] = line.count(word)
return wordcount
我的词典会告诉我每个单词中有多少出现在特定行中,当一些单词在整个文本中出现多次时,大部分都会留下1。如何让我的字典统计整个文本中的单词,而不仅仅是一行?
答案 0 :(得分:3)
问题是你每次都在重置它,修复很简单:
wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
for line in f:
for word in line.split():
line = line.lower()
word = word.strip(string.punctuation + string.digits)
if word:
if word in wordcount:
wordcount[word] += line.count(word)
else:
wordcount[word] = line.count(word)
return wordcount
答案 1 :(得分:1)
问题出在这一行:
wordcount[word] = line.count(word)
每次执行该行时,无论wordcount[word]
的值是什么,只要替换 line.count(word)
,当您希望添加时。尝试将其更改为:
wordcount[word] = wordcount[word] + line.count(word)
答案 2 :(得分:1)
我就是这样做的:
import string
wordcount = {}
with open('test.txt', 'r') as f:
for line in f:
line = line.lower() #I suppose you want boy and Boy to be the same word
for word in line.split():
#what if your word has funky punctuations chars next to it?
word = word.translate(string.maketrans("",""), string.punctuation)
#if it's already in the d increase the number
try:
wordcount[word] += 1
#if it's not this is the first time we are adding it
except:
wordcount[word] = 1
print wordcount
祝你好运!
答案 3 :(得分:0)
如果你想看到另一种方法来做到这一点。根据您的要求,它并不是一行一行地逐字逐句,但您应该知道有时可能非常有用的集合模块。
from collections import Counter
# instantiate a counter element
c = Counter()
with open('myfile.txt', 'r') as f:
for line in f:
# Do all the cleaning you need here
c.update(line.lower().split())
# Get all the statistic you want, for example:
c.most_common(10)