当对我的语料库进行单词频率计数时,结果似乎无法进行(不是我感觉最频繁的单词,频率计数只有一两个),有些结果显示“超过\ xe2”和' \ X-AD”。有人可以帮忙吗?
def toptenwords(mycorpus):
mywords = mycorpus.words()
nocapitals = [word.lower() for word in mywords]
filtered = [word for word in nocapitals if word not in stoplist]
nopunctuation= [s.translate(None, string.punctuation) for s in filtered]
woordcounter = {}
for word in nopunctuation:
if word in wordcounter:
woordcounter[word] += 1
else:
woordcounter[word] = 1
frequentwords = sorted(wordcounter.iteritems(), key = itemgetter(1), reverse = True)
top10 = frequentwords[:10]
woord1 = frequentwords[1]
woord2 = frequentwords[2]
woord3 = frequentwords[3]
woord4 = frequentwords[4]
woord5 = frequentwords[5]
woord6 = frequentwords[6]
woord7 = frequentwords[7]
woord8 = frequentwords[8]
woord9 = frequentwords[9]
woord10 = frequentwords[10]
print "De 10 meest frequente woorden zijn: ", woord1, ",", woord2, ",", woord3, ",", woord4, ",", woord5, ",", woord6, ",", woord7, ",", woord8, ",", woord9, "en", woord10
代码最初是荷兰语,这是非翻译代码:
def toptienwoorden(mycorpus):
woorden = mycorpus.words()
zonderhoofdletters = [word.lower() for word in woorden]
gefiltered = [word for word in zonderhoofdletters if word not in stoplijst]
geenleestekens = [s.translate(None, string.punctuation) for s in gefiltered]
woordteller = {}
for word in geenleestekens:
if word in woordteller:
woordteller[word] += 1
else:
woordteller[word] = 1
frequentewoorden = sorted(woordteller.iteritems(), key = itemgetter(1), reverse = True)
top10 = frequentewoorden[:10]
woord1 = frequentewoorden[1]
woord2 = frequentewoorden[2]
woord3 = frequentewoorden[3]
woord4 = frequentewoorden[4]
woord5 = frequentewoorden[5]
woord6 = frequentewoorden[6]
woord7 = frequentewoorden[7]
woord8 = frequentewoorden[8]
woord9 = frequentewoorden[9]
woord10 = frequentewoorden[10]
print "De 10 meest frequente woorden zijn: ", woord1, ",", woord2, ",", woord3, ",", woord4, ",", woord5, ",", woord6, ",", woord7, ",", woord8, ",", woord9, "en", woord10
答案 0 :(得分:1)
使用collections.Counter。它非常适合计算(可散布)项目的频率,并且它有一个most_common
方法,可以返回前十个最常用的项目,而无需您自己编写逻辑代码:
import string
import collections
def topNwords(mywords, N = 10, stoplist = set(), filtered = set()):
# mywords = mycorpus.words()
nocapitals = [word.lower() for word in mywords]
filtered = [word for word in nocapitals if word not in stoplist]
nopunctuation = [s.translate(None, string.punctuation) for s in filtered]
woordcounter = collections.Counter(nopunctuation)
top_ten = [word for word, freq in woordcounter.most_common(N)]
return top_ten
top_ten = topNwords('This is a test. It is only a test. In case of a real emergency'.split(), N = 10)
print("De 10 meest frequente woorden zijn: {w}".format(w = ', '.join(top_ten)))