Question

我需要比较大型数据帧（> 50000）中的每一行与每一行，从而导致超过10亿次比较，这在Pandas数据帧上的计算量太大。

因此，我将值加载到数组中并使用生成器进行比较：

start = df['StartPos'].values.tolist()
end = df['EndPos'].values.tolist()
index = df.index.values.tolist()
a = [(y-x, (i,j)) for i,x in enumerate(start) for j,y in enumerate(end) if (y-x) > 0 and (y-x) <= 2000 and i != j]
if len(a) == 0:
    continue
prod_sizes, rows = zip(*a)
row1,row2 = zip(*rows)

现在，对于每个数据框df，我都会得到一个类似

的列表

>>> row1
(0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4)
>>> row2
(1, 2, 3, 4, 5, 2, 3, 5, 3, 5, 2, 5, 1, 2, 3, 5)

现在，我想基于df和row1的值合并原始数据帧row2。输出数据框应如下所示：

0:   columns of row0 | columns of row1
1:   columns of row0 | columns of row2
2:   columns of row0 | columns of row3
3:   columns of row0 | columns of row4
4:   columns of row0 | columns of row5
5:   columns of row1 | columns of row2
6:   columns of row1 | columns of row3
...
15:  columns of row4 | columns of row5

是否有熊猫根据行号列表执行合并操作，还是应该简单地使用循环并通过.iloc访问行并将它们附加到新的数据框中？

Answer 1

您可以为扩展网格合并分配关键帮助

例如，您在下面的数据框中

df1=pd.DataFrame({'A':[1,2,3]})
df2=pd.DataFrame({'A':[1,2,3]})

我们使用merge键进行assign

mergedf=df1.assign(key=1).reset_index().merge(df2.assign(key=1).reset_index(),on='key')
mergedf.loc[mergedf.index_y>mergedf.index_x] # filter out the row in df1 greater than row in df2

Out[497]: 
   index_x  A_x  key  index_y  A_y
1        0    1    1        1    2
2        0    1    1        2    3
5        1    2    1        2    3

根据行号列表合并数据框

1 个答案: