目前,我已经成功编写了一个程序,该程序可以查找全部或部分时间重叠(请参见具有相同编号的“ group_overl”)。
出现两种情况:
为了更好地理解,这是一个示例:
我的df:
alias begin end duration group_overl
0 M4 2019-10-21 07:39:26.356716 2019-10-21 07:42:02.574268 156.218 1
1 M4 2019-10-21 07:40:03.235327 2019-10-21 07:42:02.222821 118.987 1
2 M4 2019-10-21 07:42:52.299657 2019-10-21 07:43:19.834114 27.534 2
3 M4 2019-10-21 07:44:09.936458 2019-10-21 07:44:37.143862 27.207 3
4 M4 2019-10-21 07:45:27.488518 2019-10-21 07:45:54.122312 26.634 4
5 M4 2019-10-21 07:57:27.564887 2019-10-21 08:26:00.413448 1712.849 11
6 M4 2019-10-21 07:58:06.209659 2019-10-21 08:27:00.413448 1734.204 11
预期结果:
alias begin end duration
0 M4 2019-10-21 07:39:26.356716 2019-10-21 07:42:02.574268 156.218
2 M4 2019-10-21 07:42:52.299657 2019-10-21 07:43:19.834114 27.534
3 M4 2019-10-21 07:44:09.936458 2019-10-21 07:44:37.143862 27.207
4 M4 2019-10-21 07:45:27.488518 2019-10-21 07:45:54.122312 26.634
5 M4 2019-10-21 07:57:27.564887 2019-10-21 08:26:00.413448 1712.849
6 M4 2019-10-21 08:26:00.413448 2019-10-21 08:27:00.413448 60
我尝试了几种治疗方法,但我做不到,谢谢您的时间!
答案 0 :(得分:0)
由于我们使用的是shift()
,因此此方法假定您已按照示例中的begin
列对数据框进行了排序。听起来您不需要按alias
分组:
使用shift
创建您提到的两个条件。对于第一个条件,请过滤出结果。对于第二个,使用where()
。
df['begin'] = pd.to_datetime(df['begin'])
df['end'] = pd.to_datetime(df['end'])
c1 = (df['begin'].between(df['begin'].shift(), df['end'].shift())
& df['end'].between(df['begin'].shift(), df['end'].shift()))
c2 = (df['begin'].between(df['begin'].shift(), df['end'].shift())
& df['end'].gt(df['end'].shift()))
df = df[~c1]
df['duration'] = df['duration'].where(~c2, (df['end'] - df['end'].shift()).dt.seconds)
df
Out[1]:
alias begin end duration \
0 M4 2019-10-21 07:39:26.356716 2019-10-21 07:42:02.574268 156.218
2 M4 2019-10-21 07:42:52.299657 2019-10-21 07:43:19.834114 27.534
3 M4 2019-10-21 07:44:09.936458 2019-10-21 07:44:37.143862 27.207
4 M4 2019-10-21 07:45:27.488518 2019-10-21 07:45:54.122312 26.634
5 M4 2019-10-21 07:57:27.564887 2019-10-21 08:26:00.413448 1712.849
6 M4 2019-10-21 07:58:06.209659 2019-10-21 08:27:00.413448 60.000
group_overl
0 1
2 2
3 3
4 4
5 11
6 11
如果要确保按组保留这些条件,则可以创建第三个条件,它们必须在同一组中。确保您这样做
df = df.sort_values(['alias','begin','end])
,然后:
df['begin'] = pd.to_datetime(df['begin'])
df['end'] = pd.to_datetime(df['end'])
c1 = (df['begin'].between(df['begin'].shift(), df['end'].shift())
& df['end'].between(df['begin'].shift(), df['end'].shift()))
c2 = (df['begin'].between(df['begin'].shift(), df['end'].shift())
& df['end'].gt(df['end'].shift()))
c3 = df['alias'] == df['alias'].shift()
df = df[~(c1 & c3)]
df['duration'] = df['duration'].where(~(c2 & c3), (df['end'] - df['end'].shift()).dt.seconds)
df
Out[2]:
alias begin end duration \
0 M4 2019-10-21 07:39:26.356716 2019-10-21 07:42:02.574268 156.218
2 M4 2019-10-21 07:42:52.299657 2019-10-21 07:43:19.834114 27.534
3 M4 2019-10-21 07:44:09.936458 2019-10-21 07:44:37.143862 27.207
4 M4 2019-10-21 07:45:27.488518 2019-10-21 07:45:54.122312 26.634
5 M4 2019-10-21 07:57:27.564887 2019-10-21 08:26:00.413448 1712.849
6 M4 2019-10-21 07:58:06.209659 2019-10-21 08:27:00.413448 60.000
group_overl
0 1
2 2
3 3
4 4
5 11
6 11