如果有多个条件,尝试计算行之间的时间差:
df['open_time'] = pd.to_datetime(df['open_time'], errors='coerce')
df['Time_diff'] = pd.to_datetime(df['Time_diff'], errors='coerce')
for i in range(1, len(df)):
if df.loc[i, 'JOB_ID'] == df.loc[i-1, 'JOB_ID'] and df.loc[i, 'STATION_IDX'] > df.loc[i-1, 'STATION_IDX']:
df['Time_diff'] = df.loc[i, 'open_time'] - df.loc[i-1, 'open_time']
open_time是HH:mm:ss一天中执行操作的简单时间,仅此而已...
原始数据集是:
JOB_ID DDMMYY STATION_IDX open_time
121663240 04-02-19 25 5:02:19
121663240 04-02-19 26 5:04:00
121663240 04-02-19 27 5:04:42
121651974 04-02-19 25 6:08:15
121651974 04-02-19 27 6:10:28
我不明白为什么我对Time_diff的所有行都不断获取“ NaT”
JOB_ID Time_diff
0 121663240 NaT
1 121663240 NaT
2 121663240 NaT
3 121651974 NaT
4 121651974 NaT
5 121682840 NaT
6 121682840 NaT
我似乎在Google中找不到适合我的计算行的答案。
我希望获得上述数据集的预期结果是:
JOB_ID ddmmyy 25 to 26 26 to 27 25 to 27
121663240 04-02-2019 101 42 143
121651974 04-02-2019 NaN NaN 133
答案 0 :(得分:0)
因此,如果Time_diff
的这两行的值相同,并且STATION_IDX增加,那么您只希望新列(open_time
)包含连续行之间的列JOB_ID
的差异。
在熊猫中,应该尝试避免显式循环,因为这不是优化熊猫的方式。这里您可能想要的是:
# first build a mask for rows having same JOB_ID and STATION_IDX as previous one
mask = (df.JOB_ID==df.JOB_ID.shift())&(df.STATION_IDX>df.STATION_IDX.shift())
# then compute the difference
df.loc[mask,'Time_diff'] = df.loc[mask, 'open_time'] - df.shift().loc[mask, 'open_time']
使用您的样本数据,可以得出:
JOB_ID DDMMYY STATION_IDX open_time Time_diff
0 121663240 04-02-19 25 2019-07-01 05:02:19 NaT
1 121663240 04-02-19 26 2019-07-01 05:04:00 00:01:41
2 121663240 04-02-19 27 2019-07-01 05:04:42 00:00:42
3 121651974 04-02-19 25 2019-07-01 06:08:15 NaT
4 121651974 04-02-19 27 2019-07-01 06:10:28 00:02:13