我在pandas中有两个大型数据帧,例如:
import pandas as pd
df = pd.DataFrame({'start' : [5, 10, 15, 20], 'stop' : [10, 20, 30, 40]})
df2 = pd.DataFrame({'id':[6, 7, 8, 12, 13, 17, 19, 38, 39, 40]})
如果start
位于stop
中,我想以id
和range(start, stop)
附加到第三个数据框的方式合并它们:
df3 = pd.DataFrame({'id':[6, 7, 8, 12, 13, 17, 19, 25, 38, 39, 40], 'start':[5, 5, 5, 10, 10, 10, 10, 20, 30, 30, 30], 'stop':[10, 10, 10, 20, 20, 20, 20, 30, 40, 40, 40]})
我试过了:
df3['start'] = pd.Series([0 for i in range(0, len(df2['id']))])
df3['stop'] = pd.Series([0 for i in range(0, len(df2['id']))])
for i in range(0, len(df2['id'])):
if df['start'][i] < df1['id'][i] < df['stop'][i]:
df['start'][i] = df3['start'][i]
df['stop'][i] = df3['stop'][i]
但这给了我一个错误。有人可以指出我哪里出错了以及如何获得所需的数据帧?此外,是否始终需要使用pd.Series
初始化一个新变量,就像我上面所做的那样?谢谢!
答案 0 :(得分:0)
假设df2
已排序,您可以使用searchsorted
df2.join(df.iloc[df.stop.searchsorted(df2.id)].set_index(df2.index))
id start stop
0 6 5 10
1 7 5 10
2 8 5 10
3 12 10 20
4 13 10 20
5 17 10 20
6 19 10 20
7 25 15 30
8 38 20 40
9 39 20 40
10 40 20 40
或者我们可以引用底层的numpy数组并执行相同的逻辑
stop = df.stop.values
ids = df2.id.values
v = df.values
pd.DataFrame(
np.column_stack([
ids, v[stop.searchsorted(ids)]
]),
columns=['id', 'start', 'stop']
)
id start stop
0 6 5 10
1 7 5 10
2 8 5 10
3 12 10 20
4 13 10 20
5 17 10 20
6 19 10 20
7 25 15 30
8 38 20 40
9 39 20 40
10 40 20 40