如何找到最常用的词来描述类别

时间:2019-03-05 18:54:31

标签: python countvectorizer

我有两栏。一个是动物,另一个是它们的描述。我想找到与Python中的每种动物相关的最常见单词。另外,我想添加一些单词集,包括英语停用词,二元语法和三元语法。也许找到前20个字/词组。

dataset = pd.read_sql( q , dlconn )
x=dataset['Animal']
y= dataset[Description]
count_vect = CountVectorizer(stop_words = esw, ngram_range=(1, 3))

1 个答案:

答案 0 :(得分:0)

要查找最常用的单词,可以使用df$business_per[df$business_per>=1] <- round(df$business_per[df$business_per>=1],0)

collections.Counter

输出:

from collections import Counter

df = pd.DataFrame({
    'Animal': ['dog', 'dog', 'cat', 'cat', 'cat', 'rabbit'] ,
    'Description':['woof hairy', 'hairy big', 'meow', 'meow', 'meow whiskers', 'carrot']
})

most_common = {}
for animal, grp_df in df.groupby('Animal')['Description']:
    counts = Counter([word for phrase in grp_df.tolist() for word in phrase.split(' ')])
    most_common[animal] = max(counts.keys(), key=lambda x: counts[x])