Question

可以使用CountVectorizer来识别一组单词是否出现在语料库中，而不管其顺序如何？

它可以执行有序短语：How can I use sklearn CountVectorizer with mutliple strings?

对于我来说，这组单词不会碰巧落在每个单词的旁边，因此将整个短语标记化，然后尝试在某些文本文档中查找将导致零查找

我梦is以求的是：

import numpy as np
from sklearn import feature_extraction

sentences = [ "The only cool Washington is DC", 
              "A cool city in Washington is Seattle",
              "Moses Lake is the dirtiest water in Washington" ]

listOfStrings = ["Washington DC",
                 "Washington Seattle",  
                 "Washington cool"]

vectorizer = CountVectorizer(vocabulary=listOfStrings)
bagowords = np.matrix(vectorizer.fit_transform(sentences).todense())
bagowords
matrix([[1, 0, 1],
        [0, 1, 1],
        [0, 0, 0],])

实际问题需要在中间插入更多单词，因此在此处删除停用词将不是有效的解决方案。任何建议都很棒！

Answer 1

如评论中所讨论的，由于您只想找出文档中是否存在某些单词，因此您需要稍微更改词汇表（listOfStrings）。

sentences = [ "The only cool Washington is DC", 
              "A cool city in Washington is Seattle",
              "Moses Lake is the dirtiest water in Washington" ]

from sklearn.feature_extraction.text import CountVectorizer
listOfStrings = ["washington", "dc", "seattle", "cool"]
vectorizer = CountVectorizer(vocabulary=listOfStrings,
                             binary=True)   

bagowords = vectorizer.fit_transform(sentences).toarray()

vectorizer.vocabulary
['washington', 'dc', 'seattle', 'cool']

bagowords
array([[1, 1, 0, 1],
       [1, 0, 1, 1],
       [1, 0, 0, 0]])

我已将binary=True添加到CountVectorizer中，因为您不希望实际计数，仅检查是否存在单词。

bagowords的输出与您提供的词汇顺序（listOfStrings）相匹配。因此，第一列代表文档中是否存在“华盛顿”，第二列代表“ dc”，依此类推。

当然，您需要注意CountVectorizer中可能影响此参数的其他参数。例如：

lowercase在默认情况下为True，因此我在listOfStrings中使用了小写字母。否则，“ DC”，“ Dc”，“ dc”被视为单独的单词。
您还应该研究token_pattern参数的效果，该参数默认情况下仅保留长度为2或更大的字母数字字符串。因此，如果您要检测单个字母单词，例如“ a”，“ I”等，则需要进行更改。

希望这会有所帮助。如果什么都不懂，请随时询问。

使用Sklearn的CountVectorizer查找不按顺序的多个字符串

1 个答案: