我有一个这样的数据框
Account timestamp no_of_transactions transaction_value
A 2016-07-26 13:43:29 2 50
B 2016-07-27 14:44:29 3 40
A 2016-07-26 13:33:29 1 15
A 2016-07-27 13:56:29 4 30
B 2016-07-26 14:33:29 7 80
C 2016-07-27 13:23:29 5 10
C 2016-07-27 13:06:29 3 10
A 2016-07-26 14:43:29 4 22
B 2016-07-27 13:43:29 1 11
我想计算no_of_transaction和transaction_value的速度,例如每个帐户每小时no_of事务和transaction_value的总数。
例如,帐户A的最小时间戳为2016-07-26 13:33:29。我想要在2016-07-26 13:33:29至2016-07-26 14:33:29下虚假的金额。然后找到下一个可用的最小时间戳记(在本例中为2016-07-26 14:43:29),并计算到下一个1小时的相同形式。就像每个帐户一样。
在1小时窗口中获得总和后,如何在添加两个新列后在实际数据帧中分配值,例如,实际df如下所示
A 2016-07-26 13:43:29 2 50 5 116
B 2016-07-27 14:44:29 3 40 3 40
A 2016-07-26 13:33:29 1 15 5 116
A 2016-07-27 13:56:29 4 30 4 30
只需将总计值与时间戳记掉的实际值相加即可。
如何高效地执行操作,无需花费很长时间即可执行
答案 0 :(得分:4)
我认为需要groupby
和sum
并与pandas.Grouper
:
df1 = df.groupby(['Account', pd.Grouper(key='timestamp', freq='H')]).sum().reset_index()
print (df1)
Account timestamp no_of_transactions transaction_value
0 A 2016-07-26 13:00:00 3 65
1 A 2016-07-26 14:00:00 4 22
2 A 2016-07-27 13:00:00 4 30
3 B 2016-07-26 14:00:00 7 80
4 B 2016-07-27 13:00:00 1 11
5 B 2016-07-27 14:00:00 3 40
6 C 2016-07-27 13:00:00 8 20
另一种采用floor
的小时精度解决方案:
df1 = df.groupby(['Account', df['timestamp'].dt.floor('h')]).sum().reset_index()
print (df1)
Account timestamp no_of_transactions transaction_value
0 A 2016-07-26 13:00:00 3 65
1 A 2016-07-26 14:00:00 4 22
2 A 2016-07-27 13:00:00 4 30
3 B 2016-07-26 14:00:00 7 80
4 B 2016-07-27 13:00:00 1 11
5 B 2016-07-27 14:00:00 3 40
6 C 2016-07-27 13:00:00 8 20
通过评论编辑:
df2 = df.groupby(['Account', pd.Grouper(key='timestamp', freq='2H')]).sum().reset_index()
print (df2)
Account timestamp no_of_transactions transaction_value
0 A 2016-07-26 12:00:00 3 65
1 A 2016-07-26 14:00:00 4 22
2 A 2016-07-27 12:00:00 4 30
3 B 2016-07-26 14:00:00 7 80
4 B 2016-07-27 12:00:00 1 11
5 B 2016-07-27 14:00:00 3 40
6 C 2016-07-27 12:00:00 8 20
编辑:
首先通过datetimes
创建DataFrame:
def f(x):
#depends of data, maybe add last 1h is not necessary
rng = pd.date_range(x.index.min(), x.index.max() + pd.Timedelta(1, unit='h'), freq='h')
return pd.Series(rng)
df2 = (df.set_index('timestamp')
.groupby('Account')
.apply(f)
.reset_index(level=1, drop=True)
.reset_index(name='timestamp1'))
print (df2)
Account timestamp1
0 A 2016-07-26 13:33:29
1 A 2016-07-26 14:33:29
2 A 2016-07-26 15:33:29
3 A 2016-07-26 16:33:29
4 A 2016-07-26 17:33:29
5 A 2016-07-26 18:33:29
6 A 2016-07-26 19:33:29
7 A 2016-07-26 20:33:29
8 A 2016-07-26 21:33:29
9 A 2016-07-26 22:33:29
10 A 2016-07-26 23:33:29
11 A 2016-07-27 00:33:29
12 A 2016-07-27 01:33:29
13 A 2016-07-27 02:33:29
14 A 2016-07-27 03:33:29
15 A 2016-07-27 04:33:29
16 A 2016-07-27 05:33:29
17 A 2016-07-27 06:33:29
18 A 2016-07-27 07:33:29
19 A 2016-07-27 08:33:29
20 A 2016-07-27 09:33:29
21 A 2016-07-27 10:33:29
22 A 2016-07-27 11:33:29
23 A 2016-07-27 12:33:29
24 A 2016-07-27 13:33:29
25 A 2016-07-27 14:33:29
26 B 2016-07-26 14:33:29
...
...
然后通过merge_asof
添加到原始df
:
df_ = df.sort_values('timestamp')
df2 = df2.sort_values('timestamp1')
df3 = pd.merge_asof(df_, df2, by='Account', left_on='timestamp', right_on='timestamp1')
print (df3)
Account timestamp no_of_transactions transaction_value \
0 A 2016-07-26 13:33:29 1 15
1 A 2016-07-26 13:43:29 2 50
2 B 2016-07-26 14:33:29 7 80
3 A 2016-07-26 14:43:29 4 22
4 C 2016-07-27 13:06:29 3 10
5 C 2016-07-27 13:23:29 5 10
6 B 2016-07-27 13:43:29 1 11
7 A 2016-07-27 13:56:29 4 30
8 B 2016-07-27 14:44:29 3 40
timestamp1
0 2016-07-26 13:33:29
1 2016-07-26 13:33:29
2 2016-07-26 14:33:29
3 2016-07-26 14:33:29
4 2016-07-27 13:06:29
5 2016-07-27 13:06:29
6 2016-07-27 13:33:29
7 2016-07-27 13:33:29
8 2016-07-27 14:33:29
汇总sum
:
df4 = df3.groupby(['Account','timestamp1'], as_index=False).sum()
print (df4)
Account timestamp1 no_of_transactions transaction_value
0 A 2016-07-26 13:33:29 3 65
1 A 2016-07-26 14:33:29 4 22
2 A 2016-07-27 13:33:29 4 30
3 B 2016-07-26 14:33:29 7 80
4 B 2016-07-27 13:33:29 1 11
5 B 2016-07-27 14:33:29 3 40
6 C 2016-07-27 13:06:29 8 20
如果要在原始DataFrame
上添加列,请先将GroupBy.transform
用于merge
,并向原始连接左连接:
df5 = df3.join(df3.groupby(['Account','timestamp1']).transform('sum').add_prefix('sum_'))
print (df5)
Account timestamp no_of_transactions transaction_value \
0 A 2016-07-26 13:33:29 1 15
1 A 2016-07-26 13:43:29 2 50
2 B 2016-07-26 14:33:29 7 80
3 A 2016-07-26 14:43:29 4 22
4 C 2016-07-27 13:06:29 3 10
5 C 2016-07-27 13:23:29 5 10
6 B 2016-07-27 13:43:29 1 11
7 A 2016-07-27 13:56:29 4 30
8 B 2016-07-27 14:44:29 3 40
timestamp1 sum_no_of_transactions sum_transaction_value
0 2016-07-26 13:33:29 3 65
1 2016-07-26 13:33:29 3 65
2 2016-07-26 14:33:29 7 80
3 2016-07-26 14:33:29 4 22
4 2016-07-27 13:06:29 8 20
5 2016-07-27 13:06:29 8 20
6 2016-07-27 13:33:29 1 11
7 2016-07-27 13:33:29 4 30
8 2016-07-27 14:33:29 3 40
cols = ['Account','timestamp','sum_no_of_transactions','sum_transaction_value']
df = df.merge(df5[cols], on=['Account','timestamp'], how='left')
print (df)
Account timestamp no_of_transactions transaction_value \
0 A 2016-07-26 13:43:29 2 50
1 B 2016-07-27 14:44:29 3 40
2 A 2016-07-26 13:33:29 1 15
3 A 2016-07-27 13:56:29 4 30
4 B 2016-07-26 14:33:29 7 80
5 C 2016-07-27 13:23:29 5 10
6 C 2016-07-27 13:06:29 3 10
7 A 2016-07-26 14:43:29 4 22
8 B 2016-07-27 13:43:29 1 11
sum_no_of_transactions sum_transaction_value
0 3 65
1 3 40
2 3 65
3 4 30
4 7 80
5 8 20
6 8 20
7 4 22
8 1 11