Question

我有两个包含字符串的数组。对于一个数组中的每个字符串，我想检查它是否以第二个数组中的字符串结尾。

输入：

strings = ['val1', 'val2', 'val3']
ends = ['1', '2', 'al1']

期望的输出：

[[ True, False,  True],
 [False,  True, False],
 [False, False, False]]

当val1在1和al1中结束时，（0,0）和（0,2）都为真。

我有以下工作代码：

import numpy as np

strings = ['val1', 'val2', 'val3']
ends = ['1', '2', 'al1']

def buildFunction(ending):
    return lambda x: x.endswith(ending)

funcs = list(map(buildFunction, ends))

def end_function_vector(val):
    return np.vectorize(lambda f, x: f(x))(funcs, np.repeat(val, len(funcs)))

result = np.array(list(map(end_function_vector, strings)))

然后返回所需的输出。

但是，对于大型数组（~10 ⁹输出元素），最后一行中的map需要相当长的时间，因为np.vectorize和map是几乎只是一个for循环的包装器。有没有人知道更快，矢量化的方法呢？

Answer 1

Numpy对chararrays有这样的操作：numpy.core.defchararray.endswith()。

下面的一些代码可以加快速度，但是当你创建两个与输出数组大小相同的数组时，它会占用大量内存：

A = np.array(['val1', 'val2', 'val3'])
B = np.array(['1', '2', 'al1'])

A_matrix = np.repeat(A[:, np.newaxis], len(B), axis=1)
B_matrix = np.repeat(B[:, np.newaxis], len(A), axis=1).transpose()

result = np.core.defchararray.endswith(A_matrix, B_matrix)

<强>更新
正如Divakar所指出的，上述代码可以合并为：

A = np.array(['val1', 'val2', 'val3'])
B = np.array(['1', '2', 'al1'])

np.core.defchararray.endswith(A[:,None], B)

Answer 2

这是一个使用NumPy broadcasting -

的几乎*矢量化方法

@data.weeks[@i].name

示例运行 -

# Get lengths of strings in each array
lens_strings = np.array(list(map(len,strings)))
lens_ends = np.array(list(map(len,ends)))

# Get the right most index of match, add the ends strings.
# The matching ones would cover the entire lengths of strings.
# So, do a final comparison against those lengths.
rfind = np.core.defchararray.rfind
out = rfind(strings[:,None], ends) + lens_ends == lens_strings[:,None]

*几乎是因为In [224]: strings = np.array(['val1', 'val2', 'val3', 'val1y', 'val341']) ...: ends = np.array(['1', '2', 'al1', 'l2']) ...: In [225]: out Out[225]: array([[ True, False, True, False], [False, True, False, True], [False, False, False, False], [False, False, False, False], [ True, False, False, False]], dtype=bool)的使用，但由于我们只使用它来获取输入元素的字符串长度，因此与解决我们所需的其他操作相比，其成本必须最小情况下。

numpy vectorized：检查数组中的字符串是否以另一个数组中的字符串结尾

2 个答案: