我有许多长度相等且日期时间索引相等的数据框
Date OPP
0 2008-01-04 0.0
1 2008-02-04 0.0
2 2008-03-04 0.0
3 2008-04-04 0.0
4 2008-05-04 0.0
5 2008-06-04 0.0
6 2008-07-04 393.75
7 2008-08-04 -168.75
8 2008-09-04 -656.25
9 2008-10-04 -1631.25
Date OPP
0 2008-01-04 750.0
1 2008-02-04 0.0
2 2008-03-04 150.0
3 2008-04-04 600.0
4 2008-05-04 0.0
5 2008-06-04 0.0
6 2008-07-04 0.0
7 2008-08-04 -250.0
8 2008-09-04 1000.0
9 2008-10-04 0.0
我需要创建一个唯一的数据帧,该数据帧汇总来自许多数据帧的所有OPP列。这样可以轻松完成:
df3 = df1["OPP"] + df2["OPP"]
df3["Date"] = df1["Date"]
只要所有数据框的长度和日期索引相同,此方法就起作用。
即使不满足这些条件,如何使它工作?如果我有另一个这样的数据框怎么办:
Date OPP
0 2008-07-04 393.75
1 2008-08-04 -168.75
2 2008-09-04 -656.25
3 2008-10-04 -1631.25
4 2008-11-04 -675.00
5 2008-12-04 0.00
我可以手动执行此操作:搜索开始日期最小的df,开始日期最大的df,并用所有日期和零填充每个df,这样我就可以得到长度相等的df ...然后简单地求和。
但是,有没有办法在Pandas中自动做到这一点?
答案 0 :(得分:2)
遵循this个答案方法,我们可以为此使用functools.reduce
。
剩下的只有sum
上的axis=1
:
from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='Date', how='left'), dfs)
哪个给了我们
Date OPP_x OPP_y OPP
0 2008-01-04 0.00 750.0 NaN
1 2008-02-04 0.00 0.0 NaN
2 2008-03-04 0.00 150.0 NaN
3 2008-04-04 0.00 600.0 NaN
4 2008-05-04 0.00 0.0 NaN
5 2008-06-04 0.00 0.0 NaN
6 2008-07-04 393.75 0.0 393.75
7 2008-08-04 -168.75 -250.0 -168.75
8 2008-09-04 -656.25 1000.0 -656.25
9 2008-10-04 -1631.25 0.0 -1631.25
然后我们总结:
df_final.iloc[:, 1:].sum(axis=1)
0 750.0
1 0.0
2 150.0
3 600.0
4 0.0
5 0.0
6 787.5
7 -587.5
8 -312.5
9 -3262.5
dtype: float64
或作为新列:
df_final['sum'] = df_final.iloc[:, 1:].sum(axis=1)
Date OPP_x OPP_y OPP sum
0 2008-01-04 0.00 750.0 NaN 750.0
1 2008-02-04 0.00 0.0 NaN 0.0
2 2008-03-04 0.00 150.0 NaN 150.0
3 2008-04-04 0.00 600.0 NaN 600.0
4 2008-05-04 0.00 0.0 NaN 0.0
5 2008-06-04 0.00 0.0 NaN 0.0
6 2008-07-04 393.75 0.0 393.75 787.5
7 2008-08-04 -168.75 -250.0 -168.75 -587.5
8 2008-09-04 -656.25 1000.0 -656.25 -312.5
9 2008-10-04 -1631.25 0.0 -1631.25 -3262.5
答案 1 :(得分:1)
使用列表推导和Series
创建DatetimeIndex
,然后通过concat
和sum
一起加入:
dfs = [df1, df2]
compr = [x.set_index('Date')['OPP'] for x in dfs]
df1 = pd.concat(compr, axis=1).sum(axis=1).reset_index(name='OPP')
print (df1)
Date OPP
0 2008-01-04 750.00
1 2008-02-04 0.00
2 2008-03-04 150.00
3 2008-04-04 600.00
4 2008-05-04 0.00
5 2008-06-04 0.00
6 2008-07-04 393.75
7 2008-08-04 -418.75
8 2008-09-04 343.75
9 2008-10-04 -1631.25
答案 2 :(得分:1)
您可以简单地concat
和sum
日期groupby
>
(pd.concat((df1,df2,df3))
.groupby('Date', as_index=False)
.sum()
)
三个示例数据帧的输出:
Date OPP
0 2008-01-04 750.0
1 2008-02-04 0.0
2 2008-03-04 150.0
3 2008-04-04 600.0
4 2008-05-04 0.0
5 2008-06-04 0.0
6 2008-07-04 787.5
7 2008-08-04 -587.5
8 2008-09-04 -312.5
9 2008-10-04 -3262.5
10 2008-11-04 -675.0
11 2008-12-04 0.0