Question

我正在使用计数矢量化器在大型文本数据集中应用字符串匹配。我想要得到的单词与结果矩阵中的任何一项都不匹配。例如，如果拟合后的结果项（特征）为：

{'hello world', 'world and', 'and stackoverflow', 'hello', 'world', 'stackoverflow', 'and'}

然后我转换了此文本：

"oh hello world and stackoverflow this is a great morning"

我想获取字符串oh this is a greate morining，因为它与功能中的任何内容都不匹配。有什么有效的方法可以做到这一点吗？

我尝试使用inverse_transform方法来获取功能并将其从文本中删除，但是遇到了很多问题，而且运行时间很长。

Answer 1

根据适合的词汇量转换一段文本将为您返回一个包含已知词汇数的矩阵。

例如，如果您的输入文档与示例相同：

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(ngram_range=(1, 2))

docs = ['hello world and stackoverflow']
vec.fit(docs)

然后合适的词汇如下：

In [522]: print(vec.vocabulary_)
{'hello': 2, 
 'world': 5, 
 'and': 0, 
 'stackoverflow': 4, 
 'hello world': 3, 
 'world and': 6, 
 'and stackoverflow': 1}

哪个代表令牌到索引的映射。转换一些新文档后，将返回一个矩阵，其中包含所有已知词汇标记的计数。 不在词汇表中的单词会被忽略！

other_docs = ['hello stackoverflow', 
              'hello and hello', 
              'oh hello world and stackoverflow this is a great morning']

X = vec.transform(other_docs)

In [523]: print(X.A)
[[0 0 1 0 1 0 0]
[1 0 2 0 0 0 0]
[1 1 1 1 1 1 1]]

您的词汇表由7个项目组成，因此矩阵X包含7列。而且我们已经转换了3个文档，因此它是一个3x7矩阵。矩阵的元素是特定单词在文档中出现频率的计数。例如，对于第二个文档“ hello and hello”，我们在第2列（索引为0）中有一个2的计数，在第0列中有一个1的计数，它们分别引用“ hello”和“和”。

逆变换是从要素（即索引）到词汇表项的映射：

In [534]: print(vec.inverse_transform([1, 2, 3, 4, 5, 6, 7]))
[array(['and', 'and stackoverflow', 'hello', 'hello world',
   'stackoverflow', 'world', 'world and'], dtype='<U17')]

注意：现在为1编入索引。上面打印的词汇索引。

现在开始讨论您的实际问题，该问题是确定给定输入文档中的所有语音（OOV）项目。如果您只对unigram感兴趣，可以使用sets非常简单：

tokens = 'oh hello world and stackoverflow this is a great morning'.split()
In [542]: print(set(tokens) - set(vec.vocabulary_.keys()))
{'morning', 'a', 'is', 'this', 'oh', 'great'}

如果您也对bigrams（或其他任何n-gram，其中n> 1）感兴趣，则事情会稍微复杂一些，因为首先您需要从输入文档中生成所有bigrams（请注意，有多种方法可以从输入文档生成所有ngram，以下只是其中的一个）：

bigrams = list(map(lambda x: ' '.join(x), zip(tokens, tokens[1:])))
In [546]: print(bigrams)
['oh hello', 'hello world', 'world and', 'and stackoverflow', 'stackoverflow     this', 'this is', 'is a', 'a great', 'great morning']

该行看起来很漂亮，但它所做的只是将zip两个列表放在一起（第二个列表从第二个项目开始），从而产生一个元组，例如('oh', 'hello')，{{1 }}语句仅将元组连接一个空格以将map转换为('oh', 'hello')，随后将地图生成器转换为'oh hello'。现在，您可以建立字母组合和双字母组合：

list

现在，您可以执行与上面的字母组合相同的操作来检索所有OOV项：

doc_vocab = set(tokens) | set(bigrams)
In [549]: print(doc_vocab)
{'and stackoverflow', 'hello', 'a', 'morning', 'hello world', 'great morning', 'world', 'stackoverflow', 'stackoverflow this', 'is', 'world and', 'oh hello', 'oh', 'this', 'is a', 'this is', 'and', 'a great', 'great'}

现在，它代表了矢量化器词汇表中未包含的所有unigram和bigrams。

CountVectorizer转换后得到不匹配的单词

1 个答案: