熊猫减去列匹配的数据帧

时间:2020-07-28 22:36:17

标签: python-3.x pandas dataframe

我有两个数据帧,我想用counter2减去counter1。最好添加诸如“ diff”之类的列。

这是我到目前为止的尝试:

import pandas as pd
import numpy as np


file = ('data.csv')

df = pd.read_csv(file)
df = df[['Release', 'Created Date', 'Finished Date']]


x = df.groupby(['Release', 'Created Date'])['Created Date'].size().to_frame('size1')
y = df.groupby(['Release', 'Finished Date'])['Finished Date'].count().to_frame('size2')

x['counter1'] = x.groupby('Release').size1.cumsum().to_frame().sort_values('Created Date')
y['counter2'] = y.groupby('Release').size2.cumsum().to_frame().sort_values('Finished Date')
print(x)
print(y)

Output for X:
                                            size1  counter1
Release                       Created Date                 
Sony                          2020-07-09        1         1
                              2020-07-14        1         2
Sega                          2020-06-30        1         1
                              2020-07-09        1         2
                              2020-07-13        1         3
                              2020-07-14        1         4
                              2020-07-15        2         6
                              2020-07-17        2         8
                              2020-07-21        1         9
Nintendo                      2020-06-29        1         1
                              2020-07-01        2         3
                              2020-07-06        1         4


Output for y:

                                             size2  counter2
Release                       Finished Date                 
Sony                          2020-07-17         1         1
                              2020-07-20         1         2
Sony                          2020-07-03         1         1
                              2020-07-13         1         2
                              2020-07-17         1         3
                              2020-07-20         1         4
                              2020-07-23         3         7
                              2020-07-24         1         8
                              2020-07-28         1         9
Nintendo                      2020-07-09         1         1
                              2020-07-10         1         2
                              2020-07-15         1         3

这是我的尝试,但结果非常令人困惑,而且肯定不正确:

t = x['counter1'] - y['counter2']

由于与文本相比,代码不得不删除输出太多,但是输出还是很奇怪。

编辑。

print(df)

output:

Release        Created Date Finished Date
0   Sony       2020-07-21    2020-07-23
1   Sony       2020-07-17    2020-07-28
2   Sony       2020-07-17    2020-07-23
3   Sony       2020-07-15    2020-07-17
4   Sony       2020-07-15    2020-07-24
..                            ...          ...           ...
76  Sony       2020-06-02    2020-06-04
77  Sega       2020-06-01    2020-06-12
79  Sega       2020-06-01    2020-07-22
80  Sony       2020-06-01    2020-06-16
81  Nintendo   2020-06-01    2020-07-16

目标是为时间线图表创建数据集,其中日期在x轴上,创建发布时,它应该在y轴上,然后当它完成时,它应该在下轴上在y轴上。

也许我把它弄得太复杂了。

更新:

我从一位社区成员那里获得的帮助使我实现了目标,非常感谢。 现在,我想以此为基础,创建一个多时间线图表,在同一张图中显示多个版本。

这是单个时间表图表的有效解决方案。


// This is how I managed to get it working for a single release, but this will eventually become a problem later when I want all the releases.
df = df[df['Release'].str.contains("Sony")]


deposits = pd.Series(df.groupby('Created').size())
withdrawals = pd.Series(df.groupby('Finished').size())
balance = pd.DataFrame({'net_movements': deposits.sub(withdrawals, fill_value=0)})

balance = balance.assign(active=balance.net_movements.cumsum())
balance = balance.rename(columns={"active": "Sony"})

print(balance)

Output:

            net_movements  Sony
2020-06-01            3.0   3.0
2020-06-02            2.0   5.0
2020-06-03            2.0   7.0
2020-06-04           -1.0   6.0
2020-06-05            0.0   6.0
2020-06-08            1.0   7.0

我们可以删除net_movements并完成最终格式:

balance = balance.drop(['net_movements'], axis=1)
print(balance)

             Sony
2020-06-01   3.0
2020-06-02   5.0
2020-06-03   7.0
2020-06-04   6.0
2020-06-05   6.0
2020-06-08   7.0

这解决了我显示单个发行版的问题。现在,我想以此为基础,并在同一图中显示所有版本。

这是我的尝试:


deposits = pd.Series(df.groupby(['Release', 'Created']).size())
print(deposits)

output: (shortened down)

Release                        Created   
Sega                           2020-06-01    1
                               2020-06-04    1
                               2020-07-14    1
Nintendo                       2020-06-01    3
                               2020-06-02    2
                               2020-06-03    2

withdrawals = pd.Series(df.groupby(['Release', 'Finished']).size())
print(withdrawals)

Release                        Finished  
Sony                           2020-06-12    1
                               2020-06-16    2
                               2020-06-18    1
Nintendo                       2020-06-04    1
                               2020-06-05    1
                               2020-06-16    2

现在,这是复杂的地方。列不仅出现在各处,而且活动列在到达新发行版时不会重置,而是不断在发行版上打勾。

balance = balance.assign(active=balance.net_movements.cumsum())
print(balance)

Output:

                                                         net_movements  active
Release                       Created    Finished                         
Sony                          2020-06-01 2020-06-12              1       1
                                         2020-06-16              2       3
                                         2020-06-18              0       3
Nintendo                      2020-06-04 2020-06-12             -1       2
                                         2020-06-16              1       3
                                         2020-06-18              0       3

想要的格式(带有伪值):

             Sony     Nintendo
2020-06-01   3.0           4.0
2020-06-02   5.0           5.0
2020-06-03   7.0           2.0
2020-06-04   6.0           4.0
2020-06-05   6.0           4.0
2020-06-08   7.0           7.0

很难用尽可能少的信息来问正确的问题,但是与此同时,这个问题有点长,但是希望我能很好地解释我的目标和问题。

1 个答案:

答案 0 :(得分:2)

基本上,您正在寻找在任何给定时间点的“有效释放计数”。我将从创建没有任何数据的时间表开始,然后将“创建日期”和“完成日期”视为余额帐户上的存款/取款。

timeline = pd.DateRange(df.Created.min(), df.Finished.max(), freq='D')
deposits = pd.Series(df.groupby('Created Date').size())
withdrawals = pd.Series(df.groupby('Finished Date').size())
balance = pd.DataFrame({'net_movements': deposits.sub(withdrawals, fill_value=0)})
balance = balance.reindex(timeline, fill_value=0)
balance = balance.assign(active=balance.net_movements.cumsum())