Pandas行之间的数据帧计算

时间:2017-02-28 20:40:51

标签: python pandas datetime indexing timedelta

我试图读取日志并计算某个工作流程的持续时间。所以包含日志的数据框看起来像这样:

Timestamp    Workflow    Status
20:31:52     ABC         Started
...
...
20:32:50     ABC         Completed

为了计算持续时间,我正在使用以下代码:

start_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Started')]['Timestamp']
compl_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Completed')]['Timestamp']
duration = compl_time - start_time

我得到的答案是:

1    NaT
72   NaT
Name: Timestamp, dtype: timedelta64[ns]

我认为由于指数不同,时间差未正确计算。当然,我可以通过以下方式明确地使用每一行的索引来获得正确的答案:

duration = compl_time.loc[72] - start_time[1]

但这似乎是一种不太优雅的做事方式。有没有更好的方法来实现同样的目标?

1 个答案:

答案 0 :(得分:0)

您是对的,不同的indexes存在问题,因此输出无法对齐并获得NaN

最简单的是values将输出转换为numpy array,但需要Series(此处均为length == 1)的相同长度,以便选择boolean indexing最好使用loc

mask = log_text['Workflow']=='ABC'
start_time = log_text.loc[mask & (log_text['Status']=='Started'), 'Timestamp']
compl_time = log_text.loc[mask & (log_text['Status']=='Completed'),'Timestamp']

print (len(start_time))
1
print (len(compl_time))
1

duration = compl_time - start_time.values

print (duration)
1   00:00:58
Name: Timestamp, dtype: timedelta64[ns]
duration = compl_time.values - start_time.values

print (pd.to_timedelta(duration))
TimedeltaIndex(['00:00:58'], dtype='timedelta64[ns]', freq=None)

print (pd.Series(pd.to_timedelta(duration)))
0   00:00:58
dtype: timedelta64[ns]