我有两个不等大小和尺寸的数组:
a = [['50.561872473 25.047160868 0.0', '0']
['50.561905852 25.047537575 0.0', '1']
['50.562232967 25.048109789 0.0', '2']
['50.561940185 25.047914282 1.0', '5']]
b = [['50.561872473 25.047160868 0.0']
['50.561905852 25.047537575 0.0']
['50.561905852 25.047537575 0.0']
['50.561905852 25.047537575 0.0']
['50.562232967 25.048109789 0.0']
['50.562232967 25.048109789 0.0']
['50.561940185 25.047914282 1.0']
['50.561940185 25.047914282 1.0']
['50.561940185 25.047914282 1.0']]
b
包含多次出现a
的第一列值。这是数组之间的连接。
在a
第一列与b
匹配的所需输出数组中
第一列我想添加a
的第二列,以便:
c = [['50.561872473 25.047160868 0.0', '0']
['50.561905852 25.047537575 0.0', '1']
['50.561905852 25.047537575 0.0', '1']
['50.561905852 25.047537575 0.0', '1']
['50.562232967 25.048109789 0.0', '2']
['50.562232967 25.048109789 0.0', '2']
['50.561940185 25.047914282 1.0', '5']
['50.561940185 25.047914282 1.0', '5']
['50.561940185 25.047914282 1.0', '5']]
a
和b
处于数百万行中,Python For循环实现这一点太慢了。所以我希望我能用Numpy方法更有效地完成这个任务。
答案 0 :(得分:1)
您可以使用pandas
执行此操作import numpy as np
import pandas as pd
a = [['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5']]
b = [['50.561872473 25.047160868 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.562232967 25.048109789 0.0'],
['50.562232967 25.048109789 0.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0']]
df_a = pd.DataFrame(a)
df_b = pd.DataFrame(b)
print(df_a.merge(df_b))
输出
0 1
0 50.561872473 25.047160868 0.0 0
1 50.561905852 25.047537575 0.0 1
2 50.561905852 25.047537575 0.0 1
3 50.561905852 25.047537575 0.0 1
4 50.562232967 25.048109789 0.0 2
5 50.562232967 25.048109789 0.0 2
6 50.561940185 25.047914282 1.0 5
7 50.561940185 25.047914282 1.0 5
8 50.561940185 25.047914282 1.0 5
答案 1 :(得分:1)
这是否适用于您的具体情况取决于一些细节,但它适用于您已经给出的简单示例。
>>> sorted_a = a[a.argsort(axis=0)[:,0]]
>>> insertion_points = numpy.searchsorted(sorted_a[:,0], b).ravel()
>>> sorted_a[insertion_points]
array([['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5']],
dtype='<S29')
首先对a
进行排序。然后,它使用searchsorted
在a
中进行二进制搜索,以获取b
中每个值的正确插入索引。假设第一列中的值完全相等,则返回的插入索引具有两个不错的属性。首先,他们指向a
中的匹配值。其次,它们可以用作到 a
的索引,以使用精美的索引创建一个新的数组。
这使得创建第三个数组非常容易。但是,它会从a
中提取所有数据,而不是b
。如果a
和b
中的值并不总是相等,则解决方案必须更复杂。
答案 2 :(得分:0)
a = [['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5']]
b = [['50.561872473 25.047160868 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.561905852 25.047537575 0.0'],
['50.562232967 25.048109789 0.0'],
['50.562232967 25.048109789 0.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0'],
['50.561940185 25.047914282 1.0']]
a = np.array(a)
b = np.array(b)
找出他们匹配的地方。
x = b == a[:,0]
>>> x
array([[ True, False, False, False],
[False, True, False, False],
[False, True, False, False],
[False, True, False, False],
[False, False, True, False],
[False, False, True, False],
[False, False, False, True],
[False, False, False, True],
[False, False, False, True]], dtype=bool)
获取比赛的指数。
v = np.where(x)[1]
>>> v
array([0, 1, 1, 1, 2, 2, 3, 3, 3])
使用索引从a
s = a[v]
>>> s
array([['50.561872473 25.047160868 0.0', '0'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.561905852 25.047537575 0.0', '1'],
['50.562232967 25.048109789 0.0', '2'],
['50.562232967 25.048109789 0.0', '2'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5'],
['50.561940185 25.047914282 1.0', '5']],
dtype='|S29')
如果a
中存在重复项,则可能无法生成您想要的内容。