我有一个像这样的示例数据框:
import pandas as pd
df = pd.DataFrame({"id": [0]*5 + [1]*5,
"time": ['2015-01-01', '2015-01-03', '2015-01-04', '2015-01-08', '2015-01-10', '2015-02-02', '2015-02-04', '2015-02-06', '2015-02-11', '2015-02-13'],
'hit': [0,3,8,2,5, 6,12,0,7,3]})
df.time = df.time.astype('datetime64[ns]')
df = df[['id', 'time', 'hit']]
df
将输出:
id time hit
0 0 2015-01-01 0
1 0 2015-01-03 3
2 0 2015-01-04 8
3 0 2015-01-08 2
4 0 2015-01-10 5
5 1 2015-02-02 6
6 1 2015-02-04 12
7 1 2015-02-06 0
8 1 2015-02-11 7
9 1 2015-02-13 3
然后我按时间(每天)执行一次groupby
:
df.groupby(['id', pd.Grouper(key='time', freq='1D')]).hit.sum().to_frame()
结果:
hit
id time
0 2015-01-01 0
2015-01-03 3
2015-01-04 8
2015-01-08 2
2015-01-10 5
1 2015-02-02 6
2015-02-04 12
2015-02-06 0
2015-02-11 7
2015-02-13 3
但是,即使值= 0,我仍要保留每日点击量,并根据每个ID计算从第一天开始的每日点击量。 我想要的输出:
hit day_since
id time
0 2015-01-01 0 1
2015-01-02 0 2
2015-01-03 3 3
2015-01-04 8 4
2015-01-05 0 5
2015-01-06 0 6
2015-01-07 0 7
1 2015-02-02 6 1
2015-02-03 0 2
2015-02-04 12 3
2015-02-05 0 4
2015-02-06 0 5
2015-02-07 0 6
2015-02-08 0 7
cumcount
不起作用,因为它按组编号每个项目。但就我而言,我希望计算每组的连续日期差异。
有人有什么想法吗?
答案 0 :(得分:1)
在groupby
之后,
df = df.reset_index(level=0)
# container for resulting dataframe
dfs = pd.DataFrame()
for i in df.id.unique():
# prepare a series and upsample it within the same id
chunk = pd.Series(df.loc[df.id == i, 'hit'])
chunk = chunk.resample('1D').asfreq()
# create dataframe and construct some additional columns
chunk = pd.DataFrame(chunk, columns=['hit']).reset_index().fillna(0)
chunk['hit'] = chunk['hit'].astype(int)
chunk['id'] = i
chunk['day_since'] = chunk.groupby('id').cumcount() + 1
# accumulate id-wise dataframes 1 by 1 vertically
dfs = pd.concat([dfs, chunk], axis=0, ignore_index=True)
dfs = dfs.set_index(['id', 'time'])
您将获得:
hit day_since
id time
0 2015-01-01 0 1
2015-01-02 0 2
2015-01-03 3 3
2015-01-04 8 4
2015-01-05 0 5
2015-01-06 0 6
2015-01-07 0 7
2015-01-08 2 8
2015-01-09 0 9
2015-01-10 5 10
1 2015-02-02 6 1
2015-02-03 0 2
2015-02-04 12 3
2015-02-05 0 4
2015-02-06 0 5
2015-02-07 0 6
2015-02-08 0 7
2015-02-09 0 8
2015-02-10 0 9
2015-02-11 7 10
2015-02-12 0 11
2015-02-13 3 12