从Python中的一组句子中查找最常见的单词

时间:2019-07-09 17:05:44

标签: python numpy-ndarray

我在np.array中有5个句子,我想找到出现的最常见的n个单词。例如,如果n为3,我希望使用3个最常用的单词。我下面有一个例子:

0    oh i am she cool though might off her a brownie lol
1    so trash wouldnt do colors better tweet
2    love monkey brownie as much as a tweet
3    monkey get this tweet around i think
4    saw a brownie to make me some monkey

如果n为3,我希望打印以下文字:布朗尼,猴子,推特。有没有做这种事情的简单方法?

1 个答案:

答案 0 :(得分:2)

您可以借助CountVectorizer进行操作,如下所示:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

A = np.array(["oh i am she cool though might off her a brownie lol", 
              "so trash wouldnt do colors better tweet", 
              "love monkey brownie as much as a tweet",
              "monkey get this tweet around i think",
              "saw a brownie to make me some monkey" ])

n = 3
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(A)

vocabulary = vectorizer.get_feature_names()
ind  = np.argsort(X.toarray().sum(axis=0))[-n:]

top_n_words = [vocabulary[a] for a in ind]

print (top_n_words)
['tweet', 'monkey', 'brownie']

希望这会有所帮助!