我有两个数据帧,我想用counter2减去counter1。最好添加诸如“ diff”之类的列。
这是我到目前为止的尝试:
import pandas as pd
import numpy as np
file = ('data.csv')
df = pd.read_csv(file)
df = df[['Release', 'Created Date', 'Finished Date']]
x = df.groupby(['Release', 'Created Date'])['Created Date'].size().to_frame('size1')
y = df.groupby(['Release', 'Finished Date'])['Finished Date'].count().to_frame('size2')
x['counter1'] = x.groupby('Release').size1.cumsum().to_frame().sort_values('Created Date')
y['counter2'] = y.groupby('Release').size2.cumsum().to_frame().sort_values('Finished Date')
print(x)
print(y)
Output for X:
size1 counter1
Release Created Date
Sony 2020-07-09 1 1
2020-07-14 1 2
Sega 2020-06-30 1 1
2020-07-09 1 2
2020-07-13 1 3
2020-07-14 1 4
2020-07-15 2 6
2020-07-17 2 8
2020-07-21 1 9
Nintendo 2020-06-29 1 1
2020-07-01 2 3
2020-07-06 1 4
Output for y:
size2 counter2
Release Finished Date
Sony 2020-07-17 1 1
2020-07-20 1 2
Sony 2020-07-03 1 1
2020-07-13 1 2
2020-07-17 1 3
2020-07-20 1 4
2020-07-23 3 7
2020-07-24 1 8
2020-07-28 1 9
Nintendo 2020-07-09 1 1
2020-07-10 1 2
2020-07-15 1 3
这是我的尝试,但结果非常令人困惑,而且肯定不正确:
t = x['counter1'] - y['counter2']
由于与文本相比,代码不得不删除输出太多,但是输出还是很奇怪。
编辑。
print(df)
output:
Release Created Date Finished Date
0 Sony 2020-07-21 2020-07-23
1 Sony 2020-07-17 2020-07-28
2 Sony 2020-07-17 2020-07-23
3 Sony 2020-07-15 2020-07-17
4 Sony 2020-07-15 2020-07-24
.. ... ... ...
76 Sony 2020-06-02 2020-06-04
77 Sega 2020-06-01 2020-06-12
79 Sega 2020-06-01 2020-07-22
80 Sony 2020-06-01 2020-06-16
81 Nintendo 2020-06-01 2020-07-16
目标是为时间线图表创建数据集,其中日期在x轴上,创建发布时,它应该在y轴上,然后当它完成时,它应该在下轴上在y轴上。
也许我把它弄得太复杂了。
更新:
我从一位社区成员那里获得的帮助使我实现了目标,非常感谢。 现在,我想以此为基础,创建一个多时间线图表,在同一张图中显示多个版本。
这是单个时间表图表的有效解决方案。
// This is how I managed to get it working for a single release, but this will eventually become a problem later when I want all the releases.
df = df[df['Release'].str.contains("Sony")]
deposits = pd.Series(df.groupby('Created').size())
withdrawals = pd.Series(df.groupby('Finished').size())
balance = pd.DataFrame({'net_movements': deposits.sub(withdrawals, fill_value=0)})
balance = balance.assign(active=balance.net_movements.cumsum())
balance = balance.rename(columns={"active": "Sony"})
print(balance)
Output:
net_movements Sony
2020-06-01 3.0 3.0
2020-06-02 2.0 5.0
2020-06-03 2.0 7.0
2020-06-04 -1.0 6.0
2020-06-05 0.0 6.0
2020-06-08 1.0 7.0
我们可以删除net_movements并完成最终格式:
balance = balance.drop(['net_movements'], axis=1)
print(balance)
Sony
2020-06-01 3.0
2020-06-02 5.0
2020-06-03 7.0
2020-06-04 6.0
2020-06-05 6.0
2020-06-08 7.0
这解决了我显示单个发行版的问题。现在,我想以此为基础,并在同一图中显示所有版本。
这是我的尝试:
deposits = pd.Series(df.groupby(['Release', 'Created']).size())
print(deposits)
output: (shortened down)
Release Created
Sega 2020-06-01 1
2020-06-04 1
2020-07-14 1
Nintendo 2020-06-01 3
2020-06-02 2
2020-06-03 2
withdrawals = pd.Series(df.groupby(['Release', 'Finished']).size())
print(withdrawals)
Release Finished
Sony 2020-06-12 1
2020-06-16 2
2020-06-18 1
Nintendo 2020-06-04 1
2020-06-05 1
2020-06-16 2
现在,这是复杂的地方。列不仅出现在各处,而且活动列在到达新发行版时不会重置,而是不断在发行版上打勾。
balance = balance.assign(active=balance.net_movements.cumsum())
print(balance)
Output:
net_movements active
Release Created Finished
Sony 2020-06-01 2020-06-12 1 1
2020-06-16 2 3
2020-06-18 0 3
Nintendo 2020-06-04 2020-06-12 -1 2
2020-06-16 1 3
2020-06-18 0 3
想要的格式(带有伪值):
Sony Nintendo
2020-06-01 3.0 4.0
2020-06-02 5.0 5.0
2020-06-03 7.0 2.0
2020-06-04 6.0 4.0
2020-06-05 6.0 4.0
2020-06-08 7.0 7.0
很难用尽可能少的信息来问正确的问题,但是与此同时,这个问题有点长,但是希望我能很好地解释我的目标和问题。
答案 0 :(得分:2)
基本上,您正在寻找在任何给定时间点的“有效释放计数”。我将从创建没有任何数据的时间表开始,然后将“创建日期”和“完成日期”视为余额帐户上的存款/取款。
timeline = pd.DateRange(df.Created.min(), df.Finished.max(), freq='D')
deposits = pd.Series(df.groupby('Created Date').size())
withdrawals = pd.Series(df.groupby('Finished Date').size())
balance = pd.DataFrame({'net_movements': deposits.sub(withdrawals, fill_value=0)})
balance = balance.reindex(timeline, fill_value=0)
balance = balance.assign(active=balance.net_movements.cumsum())