加入两个大小不等的numpy数组,并根据common列填充第三个数组

时间:2014-12-09 13:31:09

标签: python arrays join numpy

我有两个不等大小和尺寸的数组:

a = [['50.561872473 25.047160868 0.0', '0']
['50.561905852 25.047537575 0.0', '1']
['50.562232967 25.048109789 0.0', '2']
['50.561940185 25.047914282 1.0', '5']]

b = [['50.561872473 25.047160868 0.0']
['50.561905852 25.047537575 0.0']
['50.561905852 25.047537575 0.0']
['50.561905852 25.047537575 0.0']
['50.562232967 25.048109789 0.0']
['50.562232967 25.048109789 0.0']
['50.561940185 25.047914282 1.0']
['50.561940185 25.047914282 1.0']
['50.561940185 25.047914282 1.0']]

b包含多次出现a的第一列值。这是数组之间的连接。

a第一列与b匹配的所需输出数组中 第一列我想添加a的第二列,以便:

 c = [['50.561872473 25.047160868 0.0', '0']
 ['50.561905852 25.047537575 0.0', '1']
 ['50.561905852 25.047537575 0.0', '1']
 ['50.561905852 25.047537575 0.0', '1']
 ['50.562232967 25.048109789 0.0', '2']
 ['50.562232967 25.048109789 0.0', '2']
 ['50.561940185 25.047914282 1.0', '5']
 ['50.561940185 25.047914282 1.0', '5']
 ['50.561940185 25.047914282 1.0', '5']]

ab处于数百万行中,Python For循环实现这一点太慢了。所以我希望我能用Numpy方法更有效地完成这个任务。

3 个答案:

答案 0 :(得分:1)

您可以使用pandas

执行此操作
import numpy as np
import pandas as pd

a = [['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5']]

b = [['50.561872473 25.047160868 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.562232967 25.048109789 0.0'],
['50.562232967 25.048109789 0.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0']]

df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)

print(df_a.merge(df_b))

输出

                               0  1
0  50.561872473 25.047160868 0.0  0
1  50.561905852 25.047537575 0.0  1
2  50.561905852 25.047537575 0.0  1
3  50.561905852 25.047537575 0.0  1
4  50.562232967 25.048109789 0.0  2
5  50.562232967 25.048109789 0.0  2
6  50.561940185 25.047914282 1.0  5
7  50.561940185 25.047914282 1.0  5
8  50.561940185 25.047914282 1.0  5

答案 1 :(得分:1)

这是否适用于您的具体情况取决于一些细节,但它适用于您已经给出的简单示例。

>>> sorted_a = a[a.argsort(axis=0)[:,0]]
>>> insertion_points = numpy.searchsorted(sorted_a[:,0], b).ravel()
>>> sorted_a[insertion_points]
array([['50.561872473 25.047160868 0.0', '0'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.562232967 25.048109789 0.0', '2'],
       ['50.562232967 25.048109789 0.0', '2'],
       ['50.561940185 25.047914282 1.0', '5'],
       ['50.561940185 25.047914282 1.0', '5'],
       ['50.561940185 25.047914282 1.0', '5']], 
      dtype='<S29')

首先对a进行排序。然后,它使用searchsorteda中进行二进制搜索,以获取b中每个值的正确插入索引。假设第一列中的值完全相等,则返回的插入索引具有两个不错的属性。首先,他们指向a中的匹配值。其次,它们可以用作 a的索引,以使用精美的索引创建一个新的数组。

这使得创建第三个数组非常容易。但是,它会从a中提取所有数据,而不是b。如果ab中的值并不总是相等,则解决方案必须更复杂。

答案 2 :(得分:0)

a = [['50.561872473 25.047160868 0.0', '0'],
     ['50.561905852 25.047537575 0.0', '1'],
     ['50.562232967 25.048109789 0.0', '2'],
     ['50.561940185 25.047914282 1.0', '5']]

b = [['50.561872473 25.047160868 0.0'],
     ['50.561905852 25.047537575 0.0'],
     ['50.561905852 25.047537575 0.0'],
     ['50.561905852 25.047537575 0.0'],
     ['50.562232967 25.048109789 0.0'],
     ['50.562232967 25.048109789 0.0'],
     ['50.561940185 25.047914282 1.0'],
     ['50.561940185 25.047914282 1.0'],
     ['50.561940185 25.047914282 1.0']]

a = np.array(a)
b = np.array(b)

找出他们匹配的地方。

x = b == a[:,0]

>>> x
array([[ True, False, False, False],
       [False,  True, False, False],
       [False,  True, False, False],
       [False,  True, False, False],
       [False, False,  True, False],
       [False, False,  True, False],
       [False, False, False,  True],
       [False, False, False,  True],
       [False, False, False,  True]], dtype=bool)

获取比赛的指数。

v = np.where(x)[1]

>>> v
array([0, 1, 1, 1, 2, 2, 3, 3, 3])

使用索引从a

创建结果
s = a[v]

>>> s
array([['50.561872473 25.047160868 0.0', '0'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.561905852 25.047537575 0.0', '1'],
       ['50.562232967 25.048109789 0.0', '2'],
       ['50.562232967 25.048109789 0.0', '2'],
       ['50.561940185 25.047914282 1.0', '5'],
       ['50.561940185 25.047914282 1.0', '5'],
       ['50.561940185 25.047914282 1.0', '5']], 
      dtype='|S29')

如果a中存在重复项,则可能无法生成您想要的内容。