我有一个数据框,每行包含有关事件的信息以及事件是否成功。我想计算非成功事件之间的差值,我知道如何计算字段之间的差异,但不是在使用过滤器时。
我的数据框架具有以下结构:
Timestamp Status
0 2012-01-01 OK
1 2012-01-02 OK
2 2012-01-03 FAIL
3 2012-01-05 OK
4 2012-01-06 OK
5 2012-01-07 FAIL
我想要的是计算每一行的时间直到下一次失败,所以像这样的somtethin:
Timestamp Status Days_until_next_fail
0 2012-01-01 OK 2
1 2012-01-02 OK 1
2 2012-01-03 FAIL 0
3 2012-01-05 OK 2
4 2012-01-06 OK 1
5 2012-01-07 FAIL 0
我试过了:
df['days_until_next_failure'] = df.Timestamp - df[(df.Status == '1')].Timestamp(+1)
但是返回NaT,我在文档中找不到任何应用过滤和使用shift的内容。一种选择是从结尾开始迭代数据帧,但这似乎有点低效。
答案 0 :(得分:1)
解决方案,如果列Timestamp
已排序并包含每月的所有日期:
您可以先cumsum
尝试查找数据组,然后Serie
再groupby
,然后汇总cumcount
。您得到NaN
,fillna
0
,并将输出列转换为整数astype
:
#reverse ordering
df = df[::-1]
print (df.Status == 'FAIL').astype(int).cumsum()
5 1
4 1
3 1
2 2
1 2
0 2
Name: Status, dtype: int32
#filter and get ordering of colums
df['Days_until_next_fail'] = df[df.Status=='OK']
.groupby((df.Status == 'FAIL').astype(int).cumsum())
.cumcount() + 1
#replace NaN by 0, convert values to integer
df['Days_until_next_fail'] = df['Days_until_next_fail'].fillna(0).astype(int)
#ordering to original
df.sort_index(inplace=True)
print df
Timestamp Status Days_until_next_fail
0 2012-01-01 OK 2
1 2012-01-02 OK 1
2 2012-01-03 FAIL 0
3 2012-01-05 OK 2
4 2012-01-06 OK 1
5 2012-01-07 FAIL 0
更一般的解决方案,(所有日期必须排序):
print df
Timestamp Status
0 2011-12-28 OK
1 2012-01-02 OK
2 2012-01-03 FAIL
3 2012-01-05 OK
4 2012-01-06 OK
5 2012-01-07 FAIL
#reverse ordering
df = df[::-1]
df['days_until_next_failure'] = df.groupby((df.Status == 'FAIL').astype(int).cumsum())
.apply(lambda x: x.iloc[0][0] - x.Timestamp)
.reset_index(level=0, drop=True)
print df.sort_index()
Timestamp Status days_until_next_failure
0 2011-12-28 OK 6 days
1 2012-01-02 OK 1 days
2 2012-01-03 FAIL 0 days
3 2012-01-05 OK 2 days
4 2012-01-06 OK 1 days
5 2012-01-07 FAIL 0 days
如果您需要将timedelta
的列转换为int
:
df['fail_days'] = df.groupby((df.Status == 'FAIL').astype(int).cumsum())
.apply(lambda x: ((x.iloc[0][0] - x.Timestamp) / np.timedelta64(1, 'D'))
.astype(int))
.reset_index(level=0, drop=True)
print df.sort_index()
Timestamp Status fail_days
0 2011-12-28 OK 6
1 2012-01-02 OK 1
2 2012-01-03 FAIL 0
3 2012-01-05 OK 2
4 2012-01-06 OK 1
5 2012-01-07 FAIL 0
答案 1 :(得分:1)
以下是自上次失败以来的日子,而不是下一天的日子:
is_fail = (df.Status != 'OK')
cumulative_fails = is_fail.cumsum()
fail_idx, = is_fail.nonzero()
days_since_last_fail = arange(len(is_fail))
days_since_last_fail[fail_idx[0]:] -= fail_idx[cumulative_fails[fail_idx[0]:]-1]
如果你想要正确的版本,那么你可以自己调整它,或者可能只是在开始和结束时反转原始数组。