总长不同的数据帧,索引重叠

时间:2019-07-18 13:09:24

标签: python pandas dataframe

我有许多长度相等且日期时间索引相等的数据框

    Date    OPP
0   2008-01-04  0.0
1   2008-02-04  0.0
2   2008-03-04  0.0
3   2008-04-04  0.0
4   2008-05-04  0.0
5   2008-06-04  0.0
6   2008-07-04  393.75
7   2008-08-04  -168.75
8   2008-09-04  -656.25
9   2008-10-04  -1631.25


    Date    OPP
0   2008-01-04  750.0
1   2008-02-04  0.0
2   2008-03-04  150.0
3   2008-04-04  600.0
4   2008-05-04  0.0
5   2008-06-04  0.0
6   2008-07-04  0.0
7   2008-08-04  -250.0
8   2008-09-04  1000.0
9   2008-10-04  0.0

我需要创建一个唯一的数据帧,该数据帧汇总来自许多数据帧的所有OPP列。这样可以轻松完成:

df3 = df1["OPP"] + df2["OPP"]
df3["Date"] = df1["Date"]

只要所有数据框的长度和日期索引相同,此方法就起作用。

即使不满足这些条件,如何使它工作?如果我有另一个这样的数据框怎么办:

        Date      OPP
0 2008-07-04   393.75
1 2008-08-04  -168.75
2 2008-09-04  -656.25
3 2008-10-04 -1631.25
4 2008-11-04  -675.00
5 2008-12-04     0.00

我可以手动执行此操作:搜索开始日期最小的df,开始日期最大的df,并用所有日期和零填充每个df,这样我就可以得到长度相等的df ...然后简单地求和。

但是,有没有办法在Pandas中自动做到这一点?

3 个答案:

答案 0 :(得分:2)

遵循this个答案方法,我们可以为此使用functools.reduce

剩下的只有sum上的axis=1

from functools import reduce

dfs = [df1, df2, df3]

df_final = reduce(lambda left,right: pd.merge(left,right,on='Date', how='left'), dfs)

哪个给了我们

         Date    OPP_x   OPP_y      OPP
0  2008-01-04     0.00   750.0      NaN
1  2008-02-04     0.00     0.0      NaN
2  2008-03-04     0.00   150.0      NaN
3  2008-04-04     0.00   600.0      NaN
4  2008-05-04     0.00     0.0      NaN
5  2008-06-04     0.00     0.0      NaN
6  2008-07-04   393.75     0.0   393.75
7  2008-08-04  -168.75  -250.0  -168.75
8  2008-09-04  -656.25  1000.0  -656.25
9  2008-10-04 -1631.25     0.0 -1631.25

然后我们总结:

df_final.iloc[:, 1:].sum(axis=1)

0     750.0
1       0.0
2     150.0
3     600.0
4       0.0
5       0.0
6     787.5
7    -587.5
8    -312.5
9   -3262.5
dtype: float64

或作为新列:

df_final['sum'] = df_final.iloc[:, 1:].sum(axis=1)

         Date    OPP_x   OPP_y      OPP     sum
0  2008-01-04     0.00   750.0      NaN   750.0
1  2008-02-04     0.00     0.0      NaN     0.0
2  2008-03-04     0.00   150.0      NaN   150.0
3  2008-04-04     0.00   600.0      NaN   600.0
4  2008-05-04     0.00     0.0      NaN     0.0
5  2008-06-04     0.00     0.0      NaN     0.0
6  2008-07-04   393.75     0.0   393.75   787.5
7  2008-08-04  -168.75  -250.0  -168.75  -587.5
8  2008-09-04  -656.25  1000.0  -656.25  -312.5
9  2008-10-04 -1631.25     0.0 -1631.25 -3262.5

答案 1 :(得分:1)

使用列表推导和Series创建DatetimeIndex,然后通过concatsum一起加入:

dfs = [df1, df2]

compr = [x.set_index('Date')['OPP'] for x in dfs]
df1 = pd.concat(compr, axis=1).sum(axis=1).reset_index(name='OPP')
print (df1)
         Date      OPP
0  2008-01-04   750.00
1  2008-02-04     0.00
2  2008-03-04   150.00
3  2008-04-04   600.00
4  2008-05-04     0.00
5  2008-06-04     0.00
6  2008-07-04   393.75
7  2008-08-04  -418.75
8  2008-09-04   343.75
9  2008-10-04 -1631.25

答案 2 :(得分:1)

您可以简单地concatsum日期groupby

(pd.concat((df1,df2,df3))
   .groupby('Date', as_index=False)
   .sum()
)

三个示例数据帧的输出:

          Date     OPP
0   2008-01-04   750.0
1   2008-02-04     0.0
2   2008-03-04   150.0
3   2008-04-04   600.0
4   2008-05-04     0.0
5   2008-06-04     0.0
6   2008-07-04   787.5
7   2008-08-04  -587.5
8   2008-09-04  -312.5
9   2008-10-04 -3262.5
10  2008-11-04  -675.0
11  2008-12-04     0.0