Question

我有一维单词。对于每个单词，我需要抓住它出现的每个句子，其中句子在单独的一维数组中定义。

使用for循环的简单工作示例：

import numpy as np

sentences = np.array(['This is an apple tree', 'The cat is sleeping'])
words = np.array(['apple', 'dog', 'cat'])
matches = []

for word in words:
    for sentence in sentences:
        if word in sentence:
            matches.append([word, sentence])

print(matches)

如何对此操作进行矢量化？我尝试使用np.where和np.select，但似乎没有让我进行in比较

# select example
conditions = [words in sentences]
choices = [words]
print(np.select(conditions, choices))

# where example
print(np.where(words in sentences))

两个屈服：

ValueError: shape mismatch: objects cannot be broadcast to a single shape

也许我需要以某种方式使用np.all或np.any？

Answer 1

这个问题可以通过两种不同的方式解释，解决方案略有不同。你想找到子串吗？或者您是否想要在单词边界找到匹配项？

查找子字符串

numpy.char提供了一些矢量化字符串匹配函数：

>>> np.char.find(sentences[None,:], words[:,None])
array([[11, -1],
       [-1, -1],
       [-1,  4]])

与Python自己的find函数一样，当找不到子字符串时返回-1，否则返回子字符串的索引。 [None,:]和[:,None]选择器只是将数组重新整形为可广播。

这潜水深入numpy esoterica，所以YMMV。文档报告了numpy.char中的函数：

所有这些都基于Python标准库中的字符串方法。

如果这意味着它在内部调用Python函数，那么它将不会非常快，但矢量化仍将提供一些加速。

要完全回答您的问题，您现在可以在输出上调用np.where和np.c_，如下所示：

>>> r, c = np.where(np.char.find(sentences[None,:], words[:,None]) != -1)
>>> matches = np.c_[words[r], sentences[c]]
>>> matches
array([['apple', 'This is an apple tree'],
       ['cat', 'The cat is sleeping']], 
      dtype='<U21')

（感谢Divakar提出的最后建议。）

查找精确的词匹配

如果你的目标是匹配精确的单词而不是子串，那么你可能最好将句子分成单词数组。在自然语言处理术语中，称为标记化。那么问题是句子的长度会有所不同，因此不能很好地适应固定大小的数组。这是解决这个问题的一种方法。首先，生成一个单词数组（ tokens ）和一个句子标签数组：

>>> s_words = np.array([w for s in sentences for w in s.split()])
>>> s_labels = np.array([i for i, s in enumerate(sentences) for w in s.split()])

然后以广播方式检查它们是否相等：

>>> r, c = (s_words[:,None] == words).nonzero()

按上述步骤操作，但使用句子标签作为原始句子数组的索引：

>>> #               _________< -- another layer of indirection
>>> np.c_[words[c], sentences[s_labels[r]]]
array([['apple', 'This is an apple tree'],
       ['cat', 'The cat is sleeping']], 
      dtype='<U21')

对于很长的单词列表和很多句子，这仍然会很慢，尽管它会比上面的find方法更快。使用searchsorted有一些加速搜索的技巧，但它们需要一些额外的逻辑来确保找到所有匹配。答案here提供了一些指导。

最后，请注意，这只是使用Python split()方法来“标记”句子。如果您想要真实标记化，则可以使用nltk或spacy等包中的标记生成器。

如果数组中的每个单词出现在一个句子数组中，如何执行向量化检查？

1 个答案:

查找子字符串

查找精确的词匹配