将Sklearn的CountVectorizer词汇设置为短语词典

时间:2017-04-18 18:59:49

标签: python scikit-learn nlp

您好我一直在使用scikit-learn进行文本分析,我想使用CountVectorizer来检测文档是否包含一组关键字和短语。

我知道我们可以这样做:

words = ['cat', 'dog', 'walking']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()) 

...

   cat  dog  walking
    1    1        1

我想知道是否可以调整内容以便我可以使用单词短语而不仅仅是单个单词

从上面的例子中可以看出:

phrases = ['cat in the park', 'walking my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=phrases)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()) 
... 

       cat in the park   walking my dog
            1                   1

现在使用短语的代码只输出

cat in the park   walking my dog
     0                   0

提前谢谢!

1 个答案:

答案 0 :(得分:2)

试试这个:

In [104]: lens = [len(x.split()) for x in phrases]

In [105]: mn, mx = min(lens), max(lens)

In [106]: vect = CountVectorizer(vocabulary=phrases, ngram_range=(mn, mx))

In [107]: dtm = vect.fit_transform(example)

In [108]: pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
Out[108]:
   cat in the park  walking my dog
0                1               1

In [109]: print(mn, mx)
3 4