合并具有复杂条件的两个pandas数据帧

时间:2017-07-21 11:31:46

标签: python pandas dataframe

我想合并两个数据帧。让我们考虑以下两个dfs:

DF1:

id_A,           ts_A,    course,     weight
id1, 2017-04-27 01:35:30, cotton,      3.5
id1, 2017-04-27 01:36:05, cotton,      3.5
id1, 2017-04-27 01:36:55, cotton,      3.5
id1, 2017-04-27 01:37:20, cotton,      3.5
id2, 2017-04-27 02:35:35, cotton blue, 5.0
id2, 2017-04-27 02:36:00, cotton blue, 5.0
id2, 2017-04-27 02:36:35, cotton blue, 5.0
id2, 2017-04-27 02:37:20, cotton blue, 5.0

DF2:

id_B,  ts_B,                 value
id1,   2017-03-27 01:25:40,  100
id1,   2017-03-27 01:25:50,  200
id1,   2017-03-27 01:25:50,  230
id1,   2017-04-27 01:35:40,  240
id1,   2017-04-27 01:35:50,  200
id1,   2017-04-27 01:36:00,  350
id1,   2017-04-27 01:36:10,  400
id1,   2017-04-27 01:36:20,  500
id1,   2017-04-27 01:36:30,  600
id1,   2017-04-27 01:36:40,  700
id1,   2017-04-27 01:36:50,  800
id1,   2017-04-27 01:37:00,  900
id1,   2017-04-27 01:37:10, 1000
id2,   2017-04-27 02:35:40,  1000
id2,   2017-04-27 02:35:50,  2000
id2,   2017-04-27 02:36:00,  4500
id2,   2017-04-27 02:36:10,  3000
id2,   2017-04-27 02:36:20,  6000
id2,   2017-04-27 02:36:30,  5000
id2,   2017-04-27 02:36:40,  5022
id2,   2017-04-27 02:36:50,  5040
id2,   2017-04-27 02:37:00,  3200
id2,   2017-04-27 02:37:10,  9000

df1应与df2合并,以便满足以下条件: 给定时间间隔为df1中两个连续行之间的差异,我想将其与df2中该时间间隔内的所有行的平均值合并。例如,

id_A,           ts_A,    course,     weight
id1, 2017-04-27 01:35:30, cotton,      3.5

应该合并

id_B,  ts_B,                 value
id1,   2017-04-27 01:35:40,  240
id1,   2017-04-27 01:35:50,  200
id1,   2017-04-27 01:36:00,  350

并获得

id_A,           ts_A,    course,     weight  avgValue
id1, 2017-04-27 01:35:30, cotton,      3.5  263.3

我尝试从另一个角度看问题 - 这将包括df2丢失到df1的行 - 使用merge_asof但我得不到正确的结果:

pd.merge_asof(df2_sorted, df1, left_on='ts_B', right_on='ts_A', left_by='id_B', right_by='id_A', direction='backward')

1 个答案:

答案 0 :(得分:1)

我认为您需要merge_asof,但是对于df1中的每行唯一值,使用了reset_index

df1 = df1.reset_index(drop=True)
print (df1.index)
RangeIndex(start=0, stop=8, step=1)

df = pd.merge_asof(df2_sorted, 
                   df1.reset_index(), 
                   left_on='ts_B', 
                   right_on='ts_A', 
                   left_by='id_B', 
                   right_by='id_A')

然后按输出列分组(不要忘记index列)并汇总mean

df = df.groupby(['id_A','ts_A', 'course', 'weight', 'index'], as_index=False)['value']
       .mean()
       .drop('index', axis=1)
print (df)
  id_A                ts_A       course  weight        value
0  id1 2017-04-27 01:35:30       cotton     3.5   263.333333
1  id1 2017-04-27 01:36:05       cotton     3.5   600.000000
2  id1 2017-04-27 01:36:55       cotton     3.5   950.000000
3  id2 2017-04-27 02:35:35  cotton blue     5.0  1500.000000
4  id2 2017-04-27 02:36:00  cotton blue     5.0  4625.000000
5  id2 2017-04-27 02:36:35  cotton blue     5.0  5565.500000