我的数据框如下
*------------------------------------------------------------*
| started act_id from_state to_state|
*------------------------------------------------------------*
|2019-11-06 05:49:39.571392 2 CREATED ENABLED |
|2019-11-25 22:20:59.150339 2 ENABLED DISABLED |
|2019-11-26 10:22:36.571392 2 DISABLED ENABLED |
|2019-11-14 14:57:02.571392 3 CREATED ENABLED |
|2019-12-06 16:03:44.255603 3 ENABLED DISABLED |
|2019-12-12 12:50:48.571392 3 DISABLED ENABLED |
*------------------------------------------------------------*
我想按act_id
计算天数,以显示act_id
在to_state
中停留的时间。那么act_id
处于启用或禁用状态需要多长时间,才能将其状态从启用更改为禁用?
这是我的代码
import pandas as pd
import numpy as np
df = pd.read_csv('transitions.csv', index_col=0)
df['started'] = pd.to_datetime(df['started'])
df['total_time'] = 0
df['total_time'] = df.groupby(['account_id', 'from_state', 'to_state'])['started'].diff()/np.timedelta64(1, 'D')
df
但是当它在我的新字段total_time
中以NaN
的形式显示输出时,而不是在几天后显示
*------------------------------------------------------------------------------*
| started act_id from_state to_state total_time |
*------------------------------------------------------------------------------*
|2019-11-06 05:49:39.571392 2 CREATED ENABLED NaN |
|2019-11-25 22:20:59.150339 2 ENABLED DISABLED NaN |
|2019-11-26 10:22:36.571392 2 DISABLED ENABLED NaN |
|2019-11-14 14:57:02.571392 3 CREATED ENABLED NaN |
|2019-12-06 16:03:44.255603 3 ENABLED DISABLED NaN |
|2019-12-12 12:50:48.571392 3 DISABLED ENABLED NaN |
*------------------------------------------------------------------------------*
我希望我的预期输出为
*------------------------------------------------------------------------------*
| started act_id from_state to_state total_time |
*------------------------------------------------------------------------------*
|2019-11-06 05:49:39.571392 2 CREATED ENABLED 0 |
|2019-11-25 22:20:59.150339 2 ENABLED DISABLED 19 |
|2019-11-26 10:22:36.571392 2 DISABLED ENABLED 1 |
|2019-11-14 14:57:02.571392 3 CREATED ENABLED 0 |
|2019-12-06 16:03:44.255603 3 ENABLED DISABLED 22 |
|2019-12-12 12:50:48.571392 3 DISABLED ENABLED 6 |
*------------------------------------------------------------------------------*
我在哪里做错了?
答案 0 :(得分:1)
我认为如果将所有3列分组,每组仅包含一行,这是一个问题,那么差异总是NaT
。
但是,如果仅按ID
进行分组:
df['started'] = pd.to_datetime(df['started'])
df['total_time'] = (df.groupby('act_id')['started'].diff()/np.timedelta64(1, 'D')).fillna(0)
print (df)
started act_id from_state to_state total_time
0 2019-11-06 05:49:39.571392 2 CREATED ENABLED 0.000000
1 2019-11-25 22:20:59.150339 2 ENABLED DISABLED 19.688421
2 2019-11-26 10:22:36.571392 2 DISABLED ENABLED 0.501128
3 2019-11-14 14:57:02.571392 3 CREATED ENABLED 0.000000
4 2019-12-06 16:03:44.255603 3 ENABLED DISABLED 22.046316
5 2019-12-12 12:50:48.571392 3 DISABLED ENABLED 5.866022
如果还需要进行测试,{{1}的from
列to
的{{1}}列和shift
的状态是可能的,则第一个值替换为to_state
并进行比较如果两列相等,则将掩码传递到代码的最后一行:
ID
from_state
答案 1 :(得分:1)
df['started'] = pd.to_datetime(df['started'])
df = df.merge(pd.DataFrame( pd.DataFrame( df.groupby(['act_id', 'from_state', 'to_state']).count())), how='outer', indicator=False, on=['act_id', 'from_state', 'to_state'] )
您可能需要在合并之后重新命名数据框。希望这会给您答案