从文件中读取单词到字典

时间:2015-06-21 23:44:50

标签: python dictionary

所以在我们的任务中,我的教授希望我们逐行阅读文本文件,然后逐字逐句,然后创建一个字典,计算出现的每个单词的频率。这就是我现在所拥有的:

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                wordcount[word] = line.count(word)
    return wordcount

我的词典会告诉我每个单词中有多少出现在特定行中,当一些单词在整个文本中出现多次时,大部分都会留下1。如何让我的字典统计整个文本中的单词,而不仅仅是一行?

4 个答案:

答案 0 :(得分:3)

问题是你每次都在重置它,修复很简单:

wordcount = {}
with open('/Users/user/Desktop/Text.txt', 'r', encoding='utf-8') as f:
    for line in f:
        for word in line.split():
            line = line.lower()
            word = word.strip(string.punctuation + string.digits)
            if word:
                if word in wordcount:
                    wordcount[word] += line.count(word)
                else:
                    wordcount[word] = line.count(word)
    return wordcount

答案 1 :(得分:1)

问题出在这一行:

wordcount[word] = line.count(word)

每次执行该行时,无论wordcount[word]的值是什么,只要替换 line.count(word),当您希望添加时。尝试将其更改为:

wordcount[word] = wordcount[word] + line.count(word)

答案 2 :(得分:1)

我就是这样做的:

import string

wordcount = {}
with open('test.txt', 'r') as f:
    for line in f:
        line = line.lower() #I suppose you want boy and Boy to be the same word
        for word in line.split():
            #what if your word has funky punctuations chars next to it?
            word = word.translate(string.maketrans("",""), string.punctuation)
            #if it's already in the d increase the number
            try:
                wordcount[word] += 1
            #if it's not this is the first time we are adding it
            except:
                wordcount[word] = 1

print wordcount
祝你好运!

答案 3 :(得分:0)

如果你想看到另一种方法来做到这一点。根据您的要求,它并不是一行一行地逐字逐句,但您应该知道有时可能非常有用的集合模块。

from collections import Counter
# instantiate a counter element
c = Counter()
with open('myfile.txt', 'r') as f:
     for line in f:
         # Do all the cleaning you need here 
         c.update(line.lower().split())

# Get all the statistic you want, for example:
c.most_common(10)