使用python只计算一次文本文件中的每个单词

时间:2012-09-19 23:59:37

标签: python algorithm dictionary iteration defaultdict

我有一个小python脚本,我正在为一个班级作业分配工作。该脚本读取文件并打印10个最常见和不常见的单词及其频率。对于此分配,单词定义为2个字母或更多。我有单词frequency工作得很好,但是作业的第三部分是打印文档中唯一单词的总数。唯一的单词意味着计算文档中的每个单词,只有一次。

如果不更改我当前的脚本,我怎么才能只计算一次文档中的所有单词?

P.S。我使用的是Python 2.6,所以请不要使用collections.Counter

from string import punctuation
from collections import defaultdict
import re

number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)


"""Define words as 2+ letters"""
def count_unique(s):
    count = 0
    if word in line:
        if len(word) >= 2:
            count += 1
    return count


"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')

for line in txt_file:
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
               counter[word] += 1


# Most Frequent Words
top_words = sorted(counter.iteritems(),
                    key=lambda(word, count): (-count, word))[:number] 

print "Most Frequent Words: "

for word, frequency in top_words:
    print "%s: %d" % (word, frequency)


# Least Frequent Words:
least_words = sorted(counter.iteritems(),
                    key=lambda (word, count): (count, word))[:number]

print " "
print "Least Frequent Words: "

for word, frequency in least_words:
    print "%s: %d" % (word, frequency)


# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique

2 个答案:

答案 0 :(得分:2)

计算key词典中counter的数量:

total_unique = len(counter.keys())

或更简单:

total_unique = len(counter)

答案 1 :(得分:2)

defaultdict很棒,但可能更符合您的需求。关于最频繁的单词,你需要它。但是在没有这个问题的情况下,使用defaultdict是过度的。在这种情况下,我建议改为使用set

words = set()
for line in txt_file:
    for word in line.strip().split():
        word = word.strip(punctuation).lower()
        if words_only.match(word):
               words.add(word)
num_unique_words = len(words)

现在words只包含唯一的字词。

我只发布这个,因为你说你是python的新手,所以我想确保你也知道set。同样,为了您的目的,defaultdict工作正常并且是合理的