我有一个有排序的,独特的numpy字符数组:
import numpy as np
vocab = np.asarray(['a', 'aaa', 'b', 'c', 'd', 'e', 'f'])
我有另一个未排序的数组(实际上我有数百万个):
sentence = np.asarray(['b', 'aaa', 'b', 'aaa', 'b', 'z'])
此第二个数组比第一个数组小得多,并且还可能包含不在原始数组中的值。
我想要做的是将第二个数组中的值与其对应的索引匹配,返回nan
或非匹配的特殊值。
e.g:
sentence_idx = np.asarray([2, 1, 2, 1, 2, np.nan])
我已经尝试过与np.in1d匹配函数的几次不同迭代,但似乎总是会分解包含重复单词的句子。
我还尝试了几种不同的列表推导,但是他们在我收集的数百万句话中运行得太慢了。
那么,在numpy中实现这一目标的最佳方式是什么?在R中,我使用match函数,但似乎没有numpy等价物。
答案 0 :(得分:3)
您可以使用漂亮的工具进行此类搜索np.searchsorted
,就像这样 -
# Store matching indices of 'sentence' in 'vocab' when "left-searched"
out = np.searchsorted(vocab,sentence,'left').astype(float)
# Get matching indices of 'sentence' in 'vocab' when "right-searched".
# Now, the trick is that non-matches won't have any change between left
# and right searches. So, compare these two searches and look for the
# unchanged ones, which are the invalid ones and set them as NaNs.
right_idx = np.searchsorted(vocab,sentence,'right')
out[out == right_idx] = np.nan
示例运行 -
In [17]: vocab = np.asarray(['a', 'aaa', 'b', 'c', 'd', 'e', 'f'])
...: sentence = np.asarray(['b', 'aaa', 'b', 'aaa', 'b', 'z'])
...:
In [18]: out = np.searchsorted(vocab,sentence,'left').astype(float)
...: right_idx = np.searchsorted(vocab,sentence,'right')
...: out[out == right_idx] = np.nan
...:
In [19]: out
Out[19]: array([ 2., 1., 2., 1., 2., nan])