合并时平均数据

时间:2019-07-17 20:19:15

标签: python-3.x pandas

我有两个数据框,如下所示:

 result1

     time         browncarbon          blackcarbon
 180.7452   0.506824055392119   0.4693240205237933
 180.748    0.5040641475588111  0.4671092323195378
 180.7508   0.49911820575405846 0.46344714546409305
 180.7535   0.4957944583911674  0.46030629341216533
 180.7563   0.4888745617073804  0.45557451231658985
 180.7591   0.4864626914800723  0.45633142113414893
 180.7619   0.48328511735148877 0.4548510376145042
 180.7646   0.484728828747634   0.4572818652186026
 180.7674   0.4840750981022636  0.45772491443336777
 180.7702   0.4843291425046101  0.4588332952196751

 422 rows x 3 columns

 result2

    start        end      toc 
 180.7452   180.7466    192.0
 180.7438   180.7452    194.0
 180.7424   180.7438    199.0
  180.741   180.7424    208.0
 180.7396   180.741     229.0
 180.7383   180.7396    245.0
 180.7369   180.7383    252.0
 180.7355   180.7369    245.0
 180.7341   180.7355    238.0
 180.7327   180.7341    245.0

 1364 rows x 3 columns

封装到时间行之一中的多个开始行和结束行也应对应于一个toc行,这应该是多个toc行的平均值。我怎么做?堆栈溢出有一个相关的答案。链接为:Merging two pandas dataframes with complex conditions

result3

result1['rank'] = np.arange(length1)
result3=pd.merge_asof(result1.sort_values('time'),result2,left_on='time',right_on='start')
result3.sort_values('rank').drop(['rank','start','end'], axis=1)

    time          browncarbon          blackcarbon    toc
180.7452    0.506824055392119   0.4693240205237933
 180.748    0.5040641475588111  0.4671092323195378
180.7508    0.49911820575405846 0.46344714546409305
180.7535    0.4957944583911674  0.46030629341216533
180.7563    0.4888745617073804  0.45557451231658985
180.7591    0.4864626914800723  0.45633142113414893
180.7619    0.48328511735148877 0.4548510376145042
180.7646    0.484728828747634   0.4572818652186026
180.7674    0.4840750981022636  0.45772491443336777
180.7702    0.4843291425046101  0.4588332952196751

422 rows X 4 columns

2 个答案:

答案 0 :(得分:0)

对所有行组合使用交叉联接,然后用boolean indexingSeries.between进行过滤并汇总mean,最后DataFrame.join到原始:

df = result1.assign(a=1).merge(result2.assign(a=1), on='a', how='outer')

s=df[df['time'].between(df['start'],df['end'])].groupby(result1.columns.tolist())['toc'].mean()
df = result1.join(s, result1.columns.tolist())
print (df)
       time  browncarbon  blackcarbon    toc
0  180.7452     0.506824     0.469324  193.0
1  180.7480     0.504064     0.467109    NaN
2  180.7508     0.499118     0.463447    NaN
3  180.7535     0.495794     0.460306    NaN
4  180.7563     0.488875     0.455575    NaN
5  180.7591     0.486463     0.456331    NaN
6  180.7619     0.483285     0.454851    NaN
7  180.7646     0.484729     0.457282    NaN
8  180.7674     0.484075     0.457725    NaN
9  180.7702     0.484329     0.458833    NaN

答案 1 :(得分:0)

上面的

jezrael的答案很好,但我要补充一点,即按可能具有NaN值的列分组将删除这些记录。我只会按time分组,然后将结果序列放到一个新的数据框中:

df_aux = result1.assign(a=1).merge(result2.assign(a=1), on='a', how='outer')
series_aux = df[df['time'].between(df['start'],df['end'])].groupby('time')['toc'].mean()

这将返回一个熊猫系列,然后您可以将其与要保留的result1中的任何数据合并。