Question

所以在我们的任务中，我的教授希望我们逐行阅读文本文件，然后逐字逐句，然后创建一个字典，计算出现的每个单词的频率。这就是我现在所拥有的：

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                wordcount[word] = line.count(word)
    return wordcount

我的词典会告诉我每个单词中有多少出现在特定行中，当一些单词在整个文本中出现多次时，大部分都会留下1。如何让我的字典统计整个文本中的单词，而不仅仅是一行？

Answer 1

问题是你每次都在重置它，修复很简单：

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                if word in wordcount:
                    wordcount[word] += line.count(word)
                else:
                    wordcount[word] = line.count(word)
    return wordcount

Answer 2

问题出在这一行：

wordcount[word] = line.count(word)

每次执行该行时，无论wordcount[word]的值是什么，只要替换 line.count(word)，当您希望添加时。尝试将其更改为：

wordcount[word] = wordcount[word] + line.count(word)

Answer 3

我就是这样做的：

import string

wordcount = {}
with open('test.txt', 'r') as f:
    for line in f:
        line = line.lower() #I suppose you want boy and Boy to be the same word
        for word in line.split():
            #what if your word has funky punctuations chars next to it?
            word = word.translate(string.maketrans("",""), string.punctuation)
            #if it's already in the d increase the number
            try:
                wordcount[word] += 1
            #if it's not this is the first time we are adding it
            except:
                wordcount[word] = 1

print wordcount

祝你好运！

Answer 4

如果你想看到另一种方法来做到这一点。根据您的要求，它并不是一行一行地逐字逐句，但您应该知道有时可能非常有用的集合模块。

from collections import Counter
# instantiate a counter element
c = Counter()
with open('myfile.txt', 'r') as f:
     for line in f:
         # Do all the cleaning you need here 
         c.update(line.lower().split())

# Get all the statistic you want, for example:
c.most_common(10)

从文件中读取单词到字典

4 个答案: