我有一个小的python脚本打印文本文档的10个最常用的单词(每个单词是2个字母或更多),我需要继续脚本来打印文档中10个最常见的单词。我有一个相对有效的脚本,除了它打印的10个最不常见的单词是数字(整数和浮动),当它们应该是单词时。如何只迭代单词并排除数字?这是我的完整脚本:
# Most Frequent Words:
from string import punctuation
from collections import defaultdict
number = 10
words = {}
with open("charactermask.txt") as txt_file:
words = [x.strip(punctuation).lower() for x in txt_file.read().split()]
counter = defaultdict(int)
for word in words:
if len(word) >= 2:
counter[word] += 1
top_words = sorted(counter.iteritems(),
key=lambda(word, count): (-count, word))[:number]
for word, frequency in top_words:
print "%s: %d" % (word, frequency)
# Least Frequent Words:
least_words = sorted(counter.iteritems(),
key=lambda (word, count): (count, word))[:number]
for word, frequency in least_words:
print "%s: %d" % (word, frequency)
编辑:文档的末尾(# Least Frequent Words
评论下的部分)是需要修复的部分。
答案 0 :(得分:1)
您需要一个函数letters_only()
,它将运行与[0-9]
匹配的正则表达式,如果找到任何匹配项,则返回False。像这样的东西::
def letters_only(word):
return re.search(r'[0-9]', word) is None
然后,在您说for word in words
的地方,而是说for word in filter(letters_only, words)
。
答案 1 :(得分:1)
您将需要一个过滤器 - 更改正则表达式以匹配您想要定义“单词”:
import re
alphaonly = re.compile(r"^[a-z]{2,}$")
现在,您想要首先将字频率表不包括数字吗?
counter = defaultdict(int)
with open("charactermask.txt") as txt_file:
for line in txt_file:
for word in line.strip().split():
word = word.strip(punctuation).lower()
if alphaonly.match(word):
counter[word] += 1
或者你只是想在从表中提取最不频繁的单词时跳过数字?
words_by_freq = sorted(counter.iteritems(),
key=lambda(word, count): (count, word))
i = 0
for word, frequency in words_by_freq:
if alphaonly.match(word):
i += 1
sys.stdout.write("{}: {}\n".format(word, frequency))
if i == number: break