我有一个小python脚本,我正在为一个班级作业分配工作。该脚本读取文件并打印10个最常见和不常见的单词及其频率。对于此分配,单词定义为2个字母或更多。我有单词frequency工作得很好,但是作业的第三部分是打印文档中唯一单词的总数。唯一的单词意味着计算文档中的每个单词,只有一次。
如果不更改我当前的脚本,我怎么才能只计算一次文档中的所有单词?
P.S。我使用的是Python 2.6,所以请不要使用collections.Counter
from string import punctuation
from collections import defaultdict
import re
number = 10
words = {}
total_unique = 0
words_only = re.compile(r'^[a-z]{2,}$')
counter = defaultdict(int)
"""Define words as 2+ letters"""
def count_unique(s):
count = 0
if word in line:
if len(word) >= 2:
count += 1
return count
"""Open text document, read it, strip it, then filter it"""
txt_file = open('charactermask.txt', 'r')
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
counter[word] += 1
# Most Frequent Words
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
print "Most Frequent Words: "
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
print " "
print "Least Frequent Words: "
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
# Total Unique Words:
print " "
print "Total Number of Unique Words: %s " % total_unique
答案 0 :(得分:2)
计算key
词典中counter
的数量:
total_unique = len(counter.keys())
或更简单:
total_unique = len(counter)
答案 1 :(得分:2)
defaultdict
很棒,但可能更符合您的需求。关于最频繁的单词,你需要它。但是在没有这个问题的情况下,使用defaultdict
是过度的。在这种情况下,我建议改为使用set
:
words = set()
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if words_only.match(word):
words.add(word)
num_unique_words = len(words)
现在words
只包含唯一的字词。
我只发布这个,因为你说你是python的新手,所以我想确保你也知道set
。同样,为了您的目的,defaultdict
工作正常并且是合理的