也许这是一个愚蠢的问题,但我有一个问题,用Python从语料库中提取十个最常用的单词。这是我到目前为止所得到的。 (顺便说一下,我和NLTK合作,阅读每个10 .txt文件有两个子类别的语料库)
import re
import string
from nltk.corpus import stopwords
stoplist = stopwords.words('dutch')
from collections import defaultdict
from operator import itemgetter
def toptenwords(mycorpus):
words = mycorpus.words()
no_capitals = set([word.lower() for word in words])
filtered = [word for word in no_capitals if word not in stoplist]
no_punct = [s.translate(None, string.punctuation) for s in filtered]
wordcounter = {}
for word in no_punct:
if word in wordcounter:
wordcounter[word] += 1
else:
wordcounter[word] = 1
sorting = sorted(wordcounter.iteritems(), key = itemgetter, reverse = True)
return sorting
如果我用我的语料库打印这个函数,它会给我一个列表,其中包含“1”后面的所有单词。它给了我一本字典但我的所有价值观都是一本。而且我知道,例如,'baby'这个词在我的语料库中是五六次......而且它仍然给'宝贝:1'......所以它不能按我想要的方式运作...... > 有人能帮我吗?
答案 0 :(得分:4)
如果您仍在使用NLTK,请尝试使用FreqDist(样本)功能,首先从给定样本生成频率分布。然后调用most_common(n)属性以查找样本中n个最常用的单词,按降序频率排序。类似的东西:
from nltk.probability import FreqDist
fdist = FreqDist(stoplist)
top_ten = fdist.most_common(10)
答案 1 :(得分:3)
pythonic方式:
In [1]: from collections import Counter
In [2]: words = ['hello', 'hell', 'owl', 'hello', 'world', 'war', 'hello', 'war']
In [3]: counter_obj = Counter(words)
In [4]: counter_obj.most_common() #counter_obj.most_common(n=10)
Out[4]: [('hello', 3), ('war', 2), ('hell', 1), ('world', 1), ('owl', 1)]
答案 2 :(得分:2)
问题在于您使用set
。
一个集合不包含重复项,因此当您以小写形式创建一组单词时,每个单词只有一次出现。
我们说words
是:
['banana', 'Banana', 'tomato', 'tomato','kiwi']
在你的lambda降低所有案例之后:
['banana', 'banana', 'tomato', 'tomato','kiwi']
但是你这样做:
set(['banana', 'Banana', 'tomato', 'tomato','kiwi'])
返回:
['banana','tomato','kiwi']
从那一刻起,您的计算基于no_capitals
集合,每个单词只会出现一次。不要创建set
,您的程序可能会正常工作。
答案 3 :(得分:1)
这是一个解决方案。使用前面回答中讨论的集合。
def token_words(tokn=10, s1_orig='hello i must be going'):
# tokn is the number of most common words.
# s1_orig is the text blob that needs to be checked.
# logic
# - clean the text - remove punctuations.
# - make everything lower case
# - replace common machine read errors.
# - create a dictionary with orig words and changed words.
# - create a list of unique clean words
# - read the "clean" text and count the number of clean words
# - sort and print the results
#print 'Number of tokens:', tokn
# create a dictionary to make puncuations
# spaces.
punct_dict = { ',':' ',
'-':' ',
'.':' ',
'\n':' ',
'\r':' '
}
# dictionary for machine reading errors
mach_dict = {'1':'I', '0':'O',
'6':'b','8':'B' }
# get rid of punctuations
s1 = s1_orig
for k,v in punct_dict.items():
s1 = s1.replace(k,v)
# create the original list of words.
orig_list = set(s1.split())
# for each word in the original list,
# see if it has machine errors.
# add error words to a dict.
error_words = dict()
for a_word in orig_list:
a_w2 = a_word
for k,v in mach_dict.items():
a_w2 = a_w2.replace(k,v)
# lower case the result.
a_w2 = a_w2.lower()
# add to error word dict.
try:
error_words[a_w2].append(a_word)
except:
error_words[a_w2] = [a_word]
# get rid of machine errors in the full text.
for k,v in mach_dict.items():
s1 = s1.replace(k,v)
# make everything lower case
s1 = s1.lower()
# split sentence into list.
s1_list = s1.split()
# consider only unqiue words
s1_set = set(s1_list)
# count the number of times
# the each word occurs in s1
res_dict = dict()
for a_word in s1_set:
res_dict[a_word] = s1_list.count(a_word)
# sort the result dictionary by values
print '--------------'
temp = 0
for key, value in sorted(res_dict.iteritems(), reverse=True, key=lambda (k,v): (v,k)):
if temp < tokn:
# print results for token items
# get all the words that made up the key
final_key = ''
for er in error_words[key]:
final_key = final_key + er + '|'
final_key = final_key[0:-1]
print "%s@%s" % (final_key, value)
else:
pass
temp = temp + 1
# close the function and return
return True
#-------------------------------------------------------------
# main
# read the inputs from command line
num_tokens = raw_input('Number of tokens desired: ')
raw_file = raw_input('File name: ')
# read the file
try:
if num_tokens == '': num_tokens = 10
n_t = int(num_tokens)
raw_data = open(raw_file,'r').read()
token_words(n_t, raw_data)
except:
print 'Token or file error. Please try again.'