熊猫数据框中同一客户的行与其上一行/下一行之间的时间差

时间:2019-05-30 05:11:28

标签: python pandas diff difference datediff

我有一个数据框:

In [1]: import pandas as pd;import numpy as np                                              

In [2]: df = pd.DataFrame( 
   ...: [ 
   ...:     ['A', '2019-05-10 23:59:59', 'NOT_WORKING'], 
   ...:     ['A', '2019-05-11 00:05:00', 'WORKING'], 
   ...:     ['B', '2019-05-13 07:55:00', 'NOT_WORKING'], 
   ...:     ['B', '2019-05-15 07:57:00', 'WORKING'], 
   ...:     ['B', '2019-05-16 08:03:00', 'NOT_WORKING'], 
   ...: ], columns=['cust', 'event_date', 'status']) 
   ...: df.event_date = pd.to_datetime(df.event_date)                    

In [3]: df.loc[1, 'test'] = 'Y' 
   ...: df.loc[3, 'test'] = 'Y'                                          

In [4]: df                                                               
Out[4]: 
  cust          event_date       status test
0    A 2019-05-10 23:59:59  NOT_WORKING  NaN
1    A 2019-05-11 00:05:00      WORKING    Y
2    B 2019-05-13 07:55:00  NOT_WORKING  NaN
3    B 2019-05-15 07:57:00      WORKING    Y
4    B 2019-05-16 08:03:00  NOT_WORKING  NaN

我需要找出同一客户的测试行与其上一个/下一个行之间的时间差。

我这样做是这样的:

In [5]: df.loc[:, 'prev_time'] = df.event_date.shift(1) 
   ...: df.loc[:, 'prev_cust'] = df.cust.shift(1) 
   ...: df.loc[:, 'next_time'] = df.event_date.shift(-1) 
   ...: df.loc[:, 'next_cust'] = df.cust.shift(-1) 
   ...: df                                                               
Out[5]: 
  cust          event_date  ...           next_time next_cust
0    A 2019-05-10 23:59:59  ... 2019-05-11 00:05:00         A
1    A 2019-05-11 00:05:00  ... 2019-05-13 07:55:00         B
2    B 2019-05-13 07:55:00  ... 2019-05-15 07:57:00         B
3    B 2019-05-15 07:57:00  ... 2019-05-16 08:03:00         B
4    B 2019-05-16 08:03:00  ...                 NaT       NaN

[5 rows x 8 columns]

In [9]: df = df.loc[df.test=='Y', :].assign(time_to_prev=lambda row: row.
   ...: event_date - row.prev_time ).assign(time_to_next=lambda row: row.
   ...: next_time - row.event_date) 
   ...: df.loc[df.cust != df.prev_cust, 'time_to_prev'] = np.nan 
   ...: df.loc[df.cust != df.next_cust, 'time_to_next'] = np.nan 
   ...: df = df.drop(columns=['prev_time', 'prev_cust', 'next_time', 'nex
   ...: t_cust']) 
   ...: df                                                               
Out[9]: 
  cust          event_date   status test    time_to_prev    time_to_next
1    A 2019-05-11 00:05:00  WORKING    Y 0 days 00:05:01             NaT
3    B 2019-05-15 07:57:00  WORKING    Y 2 days 00:02:00 1 days 00:06:00

它可以工作,但是我正在寻找一种更优雅的解决方案,该解决方案将合并groupby,diff ... 该怎么做?

2 个答案:

答案 0 :(得分:1)

首先只需确保对“ cust”和“ event_date”的排序是正确的,然后按客户分组,然后对每一行取不同的值即可。

df = df.sort_values(['cust', 'event_date'])
df.groupby('cust')['event_date'].diff()


       event_date
0             NaT
1 0 days 00:05:01
2             NaT
3 2 days 00:02:00
4 1 days 00:06:00

答案 1 :(得分:1)

time_to_prev列中使用DataFrameGroupBy.diff,然后在time_to_next中使用DataFrameGroupBy.shift,最后仅过滤boolean indexingY行:

#if not sorted customers with datetimes column
#df = df.sort_values(['cust', 'event_date'])
df['time_to_prev'] = df.groupby('cust')['event_date'].diff()
df['time_to_next'] = df.groupby('cust')['time_to_prev'].shift(-1)

df = df[df.test=='Y'].copy()
print (df)          
  cust          event_date   status test    time_to_prev    time_to_next
1    A 2019-05-11 00:05:00  WORKING    Y 0 days 00:05:01             NaT
3    B 2019-05-15 07:57:00  WORKING    Y 2 days 00:02:00 1 days 00:06:00