Question

我有两个numpy数组。第一个是Z1，长约300,000行，宽3列。第二个，Z2，大约200,000行和300列。每个Z1和Z2的每一行都有一个识别号码（10位）。 Z2包含Z1中项目的子集，我想根据10位识别号码将Z2中的行与Z1中的伙伴进行匹配，然后从Z1中取出第2列和第3列，并将它们插入到Z2的末尾。他们适当的行。

Z1和Z2都没有任何特定的顺序。

我想出这个的唯一方法是迭代数组，这需要几个小时。在Python中有更好的方法吗？

谢谢！

Answer 1

我从您的问题中了解到，10位数字标识符存储在第1列中，对吧？

这不是很容易理解，很多间接发生，但最后unsorted_insert的行号是Z1 Z2的每个标识符

sort_idx = np.argsort(Z1[:, 0])
sorted_insert = np.searchsorted(Z1[:, 0], Z2[:, 0], sorter=sort_idx)
# The following is equivalent to unsorted_insert = sort_idx[sorted_insert] but faster
unsorted_insert = np.take(sort_idx, sorted_insert)

所以现在我们需要做的就是获取这些行的最后两列并将它们堆叠到Z2数组中：

new_Z2 = np.hstack((Z2, Z1[unsorted_insert, 1:]))

一个没有问题的简单示例：

import numpy as np

z1_rows, z1_cols = 300000, 3
z2_rows, z2_cols = 200000, 300

z1 = np.arange(z1_rows*z1_cols).reshape(z1_rows, z1_cols)

z2 = np.random.randint(10000, size=(z2_rows, z2_cols))
z2[:, 0] = z1[np.random.randint(z1_rows, size=(z2_rows,)), 0]

sort_idx = np.argsort(z1[:, 0])
sorted_insert = np.searchsorted(z1[:, 0], z2[:, 0], sorter=sort_idx)
# The following is equivalent to unsorted_insert = sort_idx[sorted_insert] but faster
unsorted_insert = np.take(sort_idx, sorted_insert)
new_z2 = np.hstack((z2, z1[unsorted_insert, 1:]))

还没有计时，但整件事似乎在几秒钟内就完成了。

numpy数组中的匹配元素

1 个答案: