Question

我有一个这样的数据框

Account     timestamp          no_of_transactions      transaction_value
   A     2016-07-26 13:43:29           2                      50
   B     2016-07-27 14:44:29           3                      40
   A     2016-07-26 13:33:29           1                      15
   A     2016-07-27 13:56:29           4                      30
   B     2016-07-26 14:33:29           7                      80
   C     2016-07-27 13:23:29           5                      10
   C     2016-07-27 13:06:29           3                      10
   A     2016-07-26 14:43:29           4                      22
   B     2016-07-27 13:43:29           1                      11

我想计算no_of_transaction和transaction_value的速度，例如每个帐户每小时no_of事务和transaction_value的总数。

例如，帐户A的最小时间戳为2016-07-26 13:33:29。我想要在2016-07-26 13:33:29至2016-07-26 14:33:29下虚假的金额。然后找到下一个可用的最小时间戳记（在本例中为2016-07-26 14:43:29），并计算到下一个1小时的相同形式。就像每个帐户一样。

在1小时窗口中获得总和后，如何在添加两个新列后在实际数据帧中分配值，例如，实际df如下所示

A        2016-07-26 13:43:29      2          50            5            116
B        2016-07-27 14:44:29      3          40            3            40 
A        2016-07-26 13:33:29      1          15            5            116
A        2016-07-27 13:56:29      4          30            4            30

只需将总计值与时间戳记掉的实际值相加即可。

如何高效地执行操作，无需花费很长时间即可执行

Answer 1

我认为需要groupby和sum并与pandas.Grouper：

df1 = df.groupby(['Account', pd.Grouper(key='timestamp', freq='H')]).sum().reset_index()
print (df1)
  Account           timestamp  no_of_transactions  transaction_value
0       A 2016-07-26 13:00:00                   3                 65
1       A 2016-07-26 14:00:00                   4                 22
2       A 2016-07-27 13:00:00                   4                 30
3       B 2016-07-26 14:00:00                   7                 80
4       B 2016-07-27 13:00:00                   1                 11
5       B 2016-07-27 14:00:00                   3                 40
6       C 2016-07-27 13:00:00                   8                 20

另一种采用floor的小时精度解决方案：

df1 = df.groupby(['Account', df['timestamp'].dt.floor('h')]).sum().reset_index()
print (df1)
  Account           timestamp  no_of_transactions  transaction_value
0       A 2016-07-26 13:00:00                   3                 65
1       A 2016-07-26 14:00:00                   4                 22
2       A 2016-07-27 13:00:00                   4                 30
3       B 2016-07-26 14:00:00                   7                 80
4       B 2016-07-27 13:00:00                   1                 11
5       B 2016-07-27 14:00:00                   3                 40
6       C 2016-07-27 13:00:00                   8                 20

通过评论编辑：

df2 = df.groupby(['Account', pd.Grouper(key='timestamp', freq='2H')]).sum().reset_index()
print (df2)
  Account           timestamp  no_of_transactions  transaction_value
0       A 2016-07-26 12:00:00                   3                 65
1       A 2016-07-26 14:00:00                   4                 22
2       A 2016-07-27 12:00:00                   4                 30
3       B 2016-07-26 14:00:00                   7                 80
4       B 2016-07-27 12:00:00                   1                 11
5       B 2016-07-27 14:00:00                   3                 40
6       C 2016-07-27 12:00:00                   8                 20

编辑：

首先通过datetimes创建DataFrame：

def f(x):
    #depends of data, maybe add last 1h is not necessary
    rng = pd.date_range(x.index.min(), x.index.max() + pd.Timedelta(1, unit='h'), freq='h')
    return pd.Series(rng)

df2 = (df.set_index('timestamp')
        .groupby('Account')
        .apply(f)
        .reset_index(level=1, drop=True)
        .reset_index(name='timestamp1'))
print (df2)
   Account          timestamp1
0        A 2016-07-26 13:33:29
1        A 2016-07-26 14:33:29
2        A 2016-07-26 15:33:29
3        A 2016-07-26 16:33:29
4        A 2016-07-26 17:33:29
5        A 2016-07-26 18:33:29
6        A 2016-07-26 19:33:29
7        A 2016-07-26 20:33:29
8        A 2016-07-26 21:33:29
9        A 2016-07-26 22:33:29
10       A 2016-07-26 23:33:29
11       A 2016-07-27 00:33:29
12       A 2016-07-27 01:33:29
13       A 2016-07-27 02:33:29
14       A 2016-07-27 03:33:29
15       A 2016-07-27 04:33:29
16       A 2016-07-27 05:33:29
17       A 2016-07-27 06:33:29
18       A 2016-07-27 07:33:29
19       A 2016-07-27 08:33:29
20       A 2016-07-27 09:33:29
21       A 2016-07-27 10:33:29
22       A 2016-07-27 11:33:29
23       A 2016-07-27 12:33:29
24       A 2016-07-27 13:33:29
25       A 2016-07-27 14:33:29
26       B 2016-07-26 14:33:29
...
...

然后通过merge_asof添加到原始df：

df_ = df.sort_values('timestamp')
df2 = df2.sort_values('timestamp1')
df3 = pd.merge_asof(df_, df2, by='Account', left_on='timestamp', right_on='timestamp1')
print (df3)
  Account           timestamp  no_of_transactions  transaction_value  \
0       A 2016-07-26 13:33:29                   1                 15   
1       A 2016-07-26 13:43:29                   2                 50   
2       B 2016-07-26 14:33:29                   7                 80   
3       A 2016-07-26 14:43:29                   4                 22   
4       C 2016-07-27 13:06:29                   3                 10   
5       C 2016-07-27 13:23:29                   5                 10   
6       B 2016-07-27 13:43:29                   1                 11   
7       A 2016-07-27 13:56:29                   4                 30   
8       B 2016-07-27 14:44:29                   3                 40   

           timestamp1  
0 2016-07-26 13:33:29  
1 2016-07-26 13:33:29  
2 2016-07-26 14:33:29  
3 2016-07-26 14:33:29  
4 2016-07-27 13:06:29  
5 2016-07-27 13:06:29  
6 2016-07-27 13:33:29  
7 2016-07-27 13:33:29  
8 2016-07-27 14:33:29

汇总sum：

df4 = df3.groupby(['Account','timestamp1'], as_index=False).sum()
print (df4)
  Account          timestamp1  no_of_transactions  transaction_value
0       A 2016-07-26 13:33:29                   3                 65
1       A 2016-07-26 14:33:29                   4                 22
2       A 2016-07-27 13:33:29                   4                 30
3       B 2016-07-26 14:33:29                   7                 80
4       B 2016-07-27 13:33:29                   1                 11
5       B 2016-07-27 14:33:29                   3                 40
6       C 2016-07-27 13:06:29                   8                 20

如果要在原始DataFrame上添加列，请先将GroupBy.transform用于merge，并向原始连接左连接：

df5 = df3.join(df3.groupby(['Account','timestamp1']).transform('sum').add_prefix('sum_'))
print (df5)
  Account           timestamp  no_of_transactions  transaction_value  \
0       A 2016-07-26 13:33:29                   1                 15   
1       A 2016-07-26 13:43:29                   2                 50   
2       B 2016-07-26 14:33:29                   7                 80   
3       A 2016-07-26 14:43:29                   4                 22   
4       C 2016-07-27 13:06:29                   3                 10   
5       C 2016-07-27 13:23:29                   5                 10   
6       B 2016-07-27 13:43:29                   1                 11   
7       A 2016-07-27 13:56:29                   4                 30   
8       B 2016-07-27 14:44:29                   3                 40   

           timestamp1  sum_no_of_transactions  sum_transaction_value  
0 2016-07-26 13:33:29                       3                     65  
1 2016-07-26 13:33:29                       3                     65  
2 2016-07-26 14:33:29                       7                     80  
3 2016-07-26 14:33:29                       4                     22  
4 2016-07-27 13:06:29                       8                     20  
5 2016-07-27 13:06:29                       8                     20  
6 2016-07-27 13:33:29                       1                     11  
7 2016-07-27 13:33:29                       4                     30  
8 2016-07-27 14:33:29                       3                     40

cols = ['Account','timestamp','sum_no_of_transactions','sum_transaction_value']
df = df.merge(df5[cols], on=['Account','timestamp'], how='left')
print (df)
  Account           timestamp  no_of_transactions  transaction_value  \
0       A 2016-07-26 13:43:29                   2                 50   
1       B 2016-07-27 14:44:29                   3                 40   
2       A 2016-07-26 13:33:29                   1                 15   
3       A 2016-07-27 13:56:29                   4                 30   
4       B 2016-07-26 14:33:29                   7                 80   
5       C 2016-07-27 13:23:29                   5                 10   
6       C 2016-07-27 13:06:29                   3                 10   
7       A 2016-07-26 14:43:29                   4                 22   
8       B 2016-07-27 13:43:29                   1                 11   

   sum_no_of_transactions  sum_transaction_value  
0                       3                     65  
1                       3                     40  
2                       3                     65  
3                       4                     30  
4                       7                     80  
5                       8                     20  
6                       8                     20  
7                       4                     22  
8                       1                     11

如何使用python中的时间戳在数据框中进行小时明智的计算？

1 个答案: