根据DataFrame A中行的值从DataFrame B中选择行

时间:2018-06-04 14:05:39

标签: python pandas

我有两个数据帧。数据框A是:

[distance]      [measure]
17442.77000     32.792658
17442.95100     32.792658
17517.49200     37.648482
17518.29600     37.648482
17565.77600     38.287118
17565.88800     38.287118
17596.93700     41.203340
17597.29700     41.203340
17602.16400     41.477979
17602.83900     41.612774
17618.16400     42.479890
17618.71100     42.681591

和数据框B

[mileage]      [Driver]
17442.8         name1
17517.5         name2
17565.8         name3
17597.2         name4
17602.5         name5
17618.4         name6

对于数据框[mileage]中的每个B行,我想在数据框[distance]中找到A中的A.loc[(A['distance']>= milage_value) & A['distance']<= mileage_value]17442.77000 32.792658 17442.8 name1 17442.95100 32.792658 17517.49200 37.648482 17517.5 name2 17518.29600 37.648482 . . . . 所以我可以这样:

def f(x):
    return df.iloc[0,1]+(df.iloc[2,1]-df.iloc[0,1])*((df.iloc[1,0]-df.iloc[0,0])/(df.iloc[2,0]-df.iloc[0,0]))
a = df.rolling(window=3, min_periods=1).apply(f)[::3].reset_index(drop=True)

所以我可以在滚动窗口大小为3的情况下应用以下函数:

B

到目前为止,我一直在连接两个Dfs和排序值以生成上面的三元组,但是当来自df A[distance]的两个值在val Streamingdf= dataFromKafkaDF.map(some transformation).writeStream(to Kafka again) def refreshBroadcast={ BroadcastVariable.unPersist(blocking=true) newVal="new data" sparkSession.sparkContext.broadcast(BroadcastVariable) } 的距离范围内时,会出现问题。任何提示/建议都非常感谢!

1 个答案:

答案 0 :(得分:1)

我认为您可以使用merge_asof使用direction参数和drop_duplicates来使用以下内容:

df_before = pd.merge_asof(df_a, df_b, 
                 left_on='distance', 
                 right_on='mileage', 
                 direction='backward')\
              .drop_duplicates(['mileage','Driver'], keep='first')[['distance','measure']]

df_after = pd.merge_asof(df_a, df_b, 
                         left_on='distance', 
                         right_on='mileage', direction='forward')\
             .drop_duplicates(['mileage', 'Driver'], keep='last')[['distance','measure']]

df_middle = df_b.rename(columns={'Driver':'measure','mileage':'distance'})

pd.concat([df_before, df_middle, df_after]).sort_values('distance').drop_duplicates()

输出:

     distance  measure
0   17442.770  32.7927
0   17442.800    name1
1   17442.951  32.7927
2   17517.492  37.6485
1   17517.500    name2
3   17518.296  37.6485
4   17565.776  38.2871
2   17565.800    name3
5   17565.888  38.2871
6   17596.937  41.2033
3   17597.200    name4
7   17597.297  41.2033
8   17602.164   41.478
4   17602.500    name5
9   17602.839  41.6128
10  17618.164  42.4799
5   17618.400    name6
11  17618.711  42.6816