Question

我试图编写一个python代码来计算文本文件中每个单词的频率。代码应该每个唯一字显示一行。我写的代码是显示重复的单词。

import string

text = open('mary.txt','r')
textr = text.read()

for punc in string.punctuation:
    textr = textr.replace(punc, "")

wordlist = textr.split()

for word in wordlist:
    count = wordlist.count(word)
    print word,':',count

我目前的输出是......

are : 1
around : 1
as : 1
at : 2
at : 2
away : 1
back : 1
be : 2
be : 2
because : 1
below : 1
between : 1
both : 1
but : 1
by : 2
by : 2

输出应仅显示at : 2，be : 2和by : 2一次。在我的代码中我应该更改哪些内容？

Answer 1

您的代码的问题在于您创建了所有单词的列表，然后循环遍历它们。您想要创建某种仅存储唯一单词的数据结构。 dict是一种很好的方法，但事实证明，在Python中有一个专门的集合叫做Counter，它就是为了这个目的而构建的。

尝试一下（未经测试）：

from collections import Counter
import string

text = open('mary.txt','r')
textr = text.read()

for punc in string.punctuation:
    textr = textr.replace(punc, "")

counts = Counter(textr.split())

for word, count in counts.items():
    print word,':',count

Answer 2

作为实现此目的的另一种方法，您可以采用您的解决方案，将所有条目作为（单词，计数）元组添加到集合中，然后打印该集合。您可能应该重新考虑您的实现，就像@smarx指出的那样，但这将使用您的本机代码解决问题。

Answer 3

您可以尝试这样的事情：

import string

frequency = {}
text = open('mary.txt','r')
textr = text.read()

for punc in string.punctuation:
    textr = textr.replace(punc, "")

wordlist = textr.split()

for word in wordlist:
    count = frequency.get(word,0)
    frequency[word] = count + 1

frequency_list = frequency.keys()

for words in frequency_list:
    print words,':', frequency[words]

Python - 为每个唯一单词显示一行

3 个答案: