Question

我有一个小python脚本，我正在为一个班级作业分配工作。该脚本读取文件并打印10个最常见和不常见的单词及其频率。对于此分配，单词定义为2个字母或更多。我有单词frequency工作得很好，但是作业的第三部分是打印文档中唯一单词的总数。唯一的单词意味着计算文档中的每个单词，只有一次。

如果不更改我当前的脚本，我怎么才能只计算一次文档中的所有单词？

P.S。我使用的是Python 2.6，所以请不要使用collections.Counter

from string import punctuation
from collections import defaultdict
import re

number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)


"""Define words as 2+ letters"""
def count_unique(s):
    count = 0
    if word in line:
        if len(word) >= 2:
            count += 1
    return count


"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')

for line in txt_file:
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
               counter[word] += 1


# Most Frequent Words
top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

print "Most Frequent Words: "

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

print " "
print "Least Frequent Words: "

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)


# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique

Answer 1

计算key词典中counter的数量：

total_unique = len(counter.keys())

或更简单：

total_unique = len(counter)

Answer 2

defaultdict很棒，但可能更符合您的需求。关于最频繁的单词，你需要它。但是在没有这个问题的情况下，使用defaultdict是过度的。在这种情况下，我建议改为使用set：

words = set()
for line in txt_file:
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
               words.add(word)
num_unique_words = len(words)

现在words只包含唯一的字词。

我只发布这个，因为你说你是python的新手，所以我想确保你也知道set。同样，为了您的目的，defaultdict工作正常并且是合理的

使用python只计算一次文本文件中的每个单词

2 个答案: