根据pandas

时间:2017-11-07 21:06:11

标签: python pandas dataframe

我在pandas中有两个大型数据帧,例如:

import pandas as pd
df = pd.DataFrame({'start' : [5, 10, 15, 20], 'stop' : [10, 20, 30, 40]})   
df2 = pd.DataFrame({'id':[6, 7, 8, 12, 13, 17, 19, 38, 39, 40]})

如果start位于stop中,我想以idrange(start, stop)附加到第三个数据框的方式合并它们:

df3 = pd.DataFrame({'id':[6, 7, 8, 12, 13, 17, 19, 25, 38, 39, 40], 'start':[5, 5, 5, 10, 10, 10, 10, 20, 30, 30, 30], 'stop':[10, 10, 10, 20, 20, 20, 20, 30, 40, 40, 40]})

我试过了:

df3['start'] = pd.Series([0 for i in range(0, len(df2['id']))])
df3['stop'] = pd.Series([0 for i in range(0, len(df2['id']))])
for i in range(0, len(df2['id'])):
    if df['start'][i] < df1['id'][i] < df['stop'][i]:
        df['start'][i] = df3['start'][i]
        df['stop'][i] = df3['stop'][i]

但这给了我一个错误。有人可以指出我哪里出错了以及如何获得所需的数据帧?此外,是否始终需要使用pd.Series初始化一个新变量,就像我上面所做的那样?谢谢!

1 个答案:

答案 0 :(得分:0)

假设df2已排序,您可以使用searchsorted

df2.join(df.iloc[df.stop.searchsorted(df2.id)].set_index(df2.index))

    id  start  stop
0    6      5    10
1    7      5    10
2    8      5    10
3   12     10    20
4   13     10    20
5   17     10    20
6   19     10    20
7   25     15    30
8   38     20    40
9   39     20    40
10  40     20    40

或者我们可以引用底层的numpy数组并执行相同的逻辑

stop = df.stop.values
ids = df2.id.values
v = df.values

pd.DataFrame(
    np.column_stack([
        ids, v[stop.searchsorted(ids)]
    ]),
    columns=['id', 'start', 'stop']
)

    id  start  stop
0    6      5    10
1    7      5    10
2    8      5    10
3   12     10    20
4   13     10    20
5   17     10    20
6   19     10    20
7   25     15    30
8   38     20    40
9   39     20    40
10  40     20    40