我试图读取日志并计算某个工作流程的持续时间。所以包含日志的数据框看起来像这样:
Timestamp Workflow Status
20:31:52 ABC Started
...
...
20:32:50 ABC Completed
为了计算持续时间,我正在使用以下代码:
start_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Started')]['Timestamp']
compl_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Completed')]['Timestamp']
duration = compl_time - start_time
我得到的答案是:
1 NaT
72 NaT
Name: Timestamp, dtype: timedelta64[ns]
我认为由于指数不同,时间差未正确计算。当然,我可以通过以下方式明确地使用每一行的索引来获得正确的答案:
duration = compl_time.loc[72] - start_time[1]
但这似乎是一种不太优雅的做事方式。有没有更好的方法来实现同样的目标?
答案 0 :(得分:0)
您是对的,不同的indexes
存在问题,因此输出无法对齐并获得NaN
。
最简单的是values
将输出转换为numpy array
,但需要Series
(此处均为length == 1
)的相同长度,以便选择boolean indexing
最好使用loc
:
mask = log_text['Workflow']=='ABC'
start_time = log_text.loc[mask & (log_text['Status']=='Started'), 'Timestamp']
compl_time = log_text.loc[mask & (log_text['Status']=='Completed'),'Timestamp']
print (len(start_time))
1
print (len(compl_time))
1
duration = compl_time - start_time.values
print (duration)
1 00:00:58
Name: Timestamp, dtype: timedelta64[ns]
duration = compl_time.values - start_time.values
print (pd.to_timedelta(duration))
TimedeltaIndex(['00:00:58'], dtype='timedelta64[ns]', freq=None)
print (pd.Series(pd.to_timedelta(duration)))
0 00:00:58
dtype: timedelta64[ns]