我正在尝试构建一个简单的程序,它接受一个文本文件,用单词作为键构建dict()
,并将值作为每个单词出现的次数(单词频率)。
我已经了解到collections.Counter
函数可以轻松地执行此操作(以及其他方法)。我的问题是,我希望字典按频率排序,以便我可以打印第N个最常用的字。最后,我还需要为字典提供一种方法,以便稍后关联不同类型的值(单词定义的字符串)。
基本上我需要输出这个的东西:
Number of words: 5
[mostfrequentword: frequency, definition]
[2ndmostfrequentword: frequency, definition]
etc.
这是我到目前为止所做的,但它只统计单词频率,我不知道如何按频率排序字典,然后打印第N个最常用的单词:
wordlist ={}
def cleanedup(string):
alphabet = 'abcdefghijklmnopqrstuvwxyz'
cleantext = ''
for character in string.lower():
if character in alphabet:
cleantext += character
else:
cleantext += ' '
return cleantext
def text_crunch(textfile):
for line in textfile:
for word in cleanedup(line).split():
if word in wordlist:
wordlist[word] += 1
else:
wordlist[word] = 1
with open ('DQ.txt') as doc:
text_crunch(doc)
print(wordlist['todos'])
答案 0 :(得分:1)
一个更简单的代码版本,几乎可以满足您的需求:)
import string
import collections
def cleanedup(fh):
for line in fh:
word = ''
for character in line:
if character in string.ascii_letters:
word += character
elif word:
yield word
word = ''
with open ('DQ.txt') as doc:
wordlist = collections.Counter(cleanedup(doc))
print wordlist.most_commond(5)
使用正则表达式的替代解决方案:
import re
import collections
def cleandup(fh):
for line in fh:
for word in re.findall('[a-z]+', line.lower()):
yield word
with open ('DQ.txt') as doc:
wordlist = collections.Counter(cleanedup(doc))
print wordlist.most_commond(5)
或者:
import re
import collections
def cleandup(fh):
for line in fh:
for word in re.split('[^a-z]+', line.lower()):
yield word
with open ('DQ.txt') as doc:
wordlist = collections.Counter(cleanedup(doc))
print wordlist.most_commond(5)