我有一个看起来像这样的数据框
ID DATE Remark
A 2020-06-22 16:10:00 P
A 2020-06-22 11:00:00 F
A 2020-06-22 10:50:00 P
B 2020-06-22 15:15:00 P
B 2020-06-22 15:10:00 F
A 2020-06-22 10:40:00 F
B 2020-06-22 15:00:00 F
我想要类似的东西
ID DATE Duration Remark
A 2020-06-22 11:10:00 null P
A 2020-06-22 11:00:00 05:10:00 F
A 2020-06-22 10:50:00 null P
A 2020-06-22 10:40:00 00:10:00 F
B 2020-06-22 15:15:00 null P
B 2020-06-22 15:10:00 00:05:00 F
B 2020-06-22 15:00:00 00:10:00 F
DATE的dtype是datetime64,并且已经按照降序排列。
备注P的持续时间将始终为null或0。
我想我需要写类似df.groupby('ID')['DATE']....
的东西,但是我应该怎么编码呢?
答案 0 :(得分:1)
使用.groupby
与.cumcount()
来识别组中的差异,并按> 0进行过滤以忽略该组的第一行。然后使用.shift
与上一行进行比较并获得时差:
输入:
ID DATE Duration Remark
0 A 2020-06-22 11:10:00 null P
1 A 2020-06-22 11:00:00 05:10:00 F
2 A 2020-06-22 10:50:00 null P
3 A 2020-06-22 10:40:00 00:10:00 F
4 B 2020-06-22 15:15:00 null P
5 B 2020-06-22 15:10:00 00:05:00 F
6 B 2020-06-22 15:00:00 00:10:00 F
代码:
# commented the following line out asuming that it is in datetime format. If not, then use the below line.
# df['DATE'] = pd.to_datetime(df['DATE'])
df['Duration'] = np.where((df.groupby('ID').cumcount() > 0), (df.shift()['DATE'] - df['DATE']), np.nan)
输出:
ID DATE Duration Remark
0 A 2020-06-22 11:10:00 NaT P
1 A 2020-06-22 11:00:00 00:10:00 F
2 A 2020-06-22 10:50:00 00:10:00 P
3 A 2020-06-22 10:40:00 00:10:00 F
4 B 2020-06-22 15:15:00 NaT P
5 B 2020-06-22 15:10:00 00:05:00 F
6 B 2020-06-22 15:00:00 00:10:00 F
答案 1 :(得分:0)
def random_dates(start, end, n=10):
# generating random TS
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
dates = random_dates(start, end)
remark = ['P', 'F', 'P', 'F', 'P', 'F', 'P', 'F', 'P', 'F']
ids = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'D']
df = pd.DataFrame(zip(ids, dates,remark), columns=["ID", "DATE", "REMARK"]) # creating the df
# will return the difference as needed but you might have to format it as per your need;
df.groupby("ID")["DATE"].diff()
答案 2 :(得分:0)
它可能仅对样本数据有效,但是您可以按时间顺序对数据进行排序,计算差异并通过“排序依据”恢复原始顺序。又移了一行。
df['DATE'] = pd.to_datetime(df['DATE'])
df.sort_values('DATE',inplace=True)
df['Duration'] = df['DATE']-df['DATE'].shift()
df.sort_index(inplace=True)
df['Duration'] = df['Duration'].shift()
df
ID DATE Remark Duration
0 A 2020-06-22 11:10:00 P NaT
1 A 2020-06-22 11:00:00 F 00:10:00
2 B 2020-06-22 15:15:00 P NaT
3 B 2020-06-22 15:10:00 F 00:05:00
4 B 2020-06-22 15:00:00 F 00:10:00