ID范围上的两个`pandas.DataFrame`的向量化合并

时间:2019-03-15 00:24:20

标签: python pandas merge range

我有两个DataFrame,我想使用它们两个作为输入来执行一些操作。


DataFrame答: x1y1x2y2对应于矩形的坐标

+---+----+----------+----------+----------+----------+
|   | ID |    x1    |    y1    |    x2    |    y2    |
+---+----+----------+----------+----------+----------+
| 0 |  0 | 332833.5 | 502144.0 | 333214.5 | 502460.5 |
| 1 |  1 | 333537.5 | 502144.0 | 333918.5 | 502460.5 |
| 2 |  2 | 334945.5 | 502144.0 | 335326.5 | 502352.0 |
| 3 |  3 | 335713.5 | 502144.0 | 336094.5 | 502352.0 |
| 4 |  4 | 336417.5 | 502144.0 | 336798.5 | 502416.0 |
...
+---+----+----------+----------+----------+----------+

DataFrame B:

+---+-------------+-------------+--+--+
|   | min_matchID | max_matchID |  |  |
+---+-------------+-------------+--+--+
| 0 |           0 |           1 |  |  |
| 1 |           2 |           2 |  |  |
| 2 |           3 |           5 |  |  |
| 3 |           6 |           7 |  |  |
| 4 |           8 |           8 |  |  |
...
+---+-------------+-------------+--+--+

对于B中ID在min_matchIDmax_matchID之间的每一行,我想:

  • 查询A中x1属于y1的{​​{1}},x2y2ID的相应集合< / li>
  • 并构造一个range(min_matchID, max_matchID+1)类实例(例如在python软件包MultiPolygon中),例如
shapely

强力循环是显而易见的,但是它太慢了。我想知道是否存在矢量化方法?

1 个答案:

答案 0 :(得分:1)

首先,您可以使用Index.repeat根据您的min_matchIDmax_matchID重复行。

import pandas as pd
import numpy as np
from shapely.geometry import MultiPolygon,box
# generate test data
A = pd.DataFrame({'ID':range(0,10000),'x1':range(10000,20000),'y1': range(50000, 60000)
                 ,'x2': range(10000, 20000), 'y2': range(50000, 60000)})
B = pd.DataFrame({'min_matchID':np.random.randint(0,10000,size=(10000))})
B['max_matchID'] = B['min_matchID'] + np.random.randint(0,10,size=(10000))

# start 
B = B.reset_index()
idx = B.index.repeat(B.max_matchID - B.min_matchID + 1)
B = B.reindex(idx).reset_index(drop=True)
B['ID'] =  B['min_matchID'] + idx.to_series().groupby(idx).cumcount().values
print(B)

       index  min_matchID  max_matchID    ID
0          0         6889         6891  6889
1          0         6889         6891  6890
2          0         6889         6891  6891
3          1         8299         8307  8299
4          1         8299         8307  8300
5          1         8299         8307  8301
6          1         8299         8307  8302
7          1         8299         8307  8303
...      ...          ...          ...   ...
54740   9998         4278         4282  4282
54741   9999         3061         3067  3061
54742   9999         3061         3067  3062
54743   9999         3061         3067  3063
54744   9999         3061         3067  3064
54745   9999         3061         3067  3065
54746   9999         3061         3067  3066
54747   9999         3061         3067  3067

然后,您可以尝试pd.merge()组合坐标。

result = pd.merge(B,A,on='ID',how='left')
print(result)
       index  min_matchID  max_matchID    ID       x1       y1       x2       y2
0          0         6889         6891  6889  16889.0  56889.0  16889.0  56889.0
1          0         6889         6891  6890  16890.0  56890.0  16890.0  56890.0
2          0         6889         6891  6891  16891.0  56891.0  16891.0  56891.0
3          1         8299         8307  8299  18299.0  58299.0  18299.0  58299.0
4          1         8299         8307  8300  18300.0  58300.0  18300.0  58300.0
5          1         8299         8307  8301  18301.0  58301.0  18301.0  58301.0
6          1         8299         8307  8302  18302.0  58302.0  18302.0  58302.0
7          1         8299         8307  8303  18303.0  58303.0  18303.0  58303.0
...      ...          ...          ...   ...      ...      ...      ...      ...
54740   9998         4278         4282  4282  14282.0  54282.0  14282.0  54282.0
54741   9999         3061         3067  3061  13061.0  53061.0  13061.0  53061.0
54742   9999         3061         3067  3062  13062.0  53062.0  13062.0  53062.0
54743   9999         3061         3067  3063  13063.0  53063.0  13063.0  53063.0
54744   9999         3061         3067  3064  13064.0  53064.0  13064.0  53064.0
54745   9999         3061         3067  3065  13065.0  53065.0  13065.0  53065.0
54746   9999         3061         3067  3066  13066.0  53066.0  13066.0  53066.0
54747   9999         3061         3067  3067  13067.0  53067.0  13067.0  53067.0

最后,您可以按index进行分组。

result = result.groupby('index').apply(lambda x:MultiPolygon([box(x1,y1,x2,y2) for x1,y1,x2,y2 in zip(x.x1,x.y1,x.x2,x.y2)]))