我正在研究与[here] [1]类似的问题 我有一个带有两个datetime列的数据框,我需要确定重叠部分。
import pandas as pd
from datetime import datetime
df = pd.DataFrame(columns=['id','from','to'], index=range(5), \
data=[[878,'2006-01-01','2007-10-01'],
[878,'2007-10-02','2008-12-01'],
[878,'2008-12-02','2010-04-03'],
[879,'2010-04-04','2199-05-11'],
[879,'2016-05-12','2199-12-31']])
df['from'] = pd.to_datetime(df['from'])
df['to'] = pd.to_datetime(df['to'])
以下内容非常有用,可以将重叠部分识别为二进制变量
df['overlap'] = (df.groupby('id')
.apply(lambda x: (x['to'].shift() - x['from']) > pd.Timedelta(seconds=0))
.reset_index(level=0, drop=True))
(正确返回):
[49]:
id from to overlap
0 878 2006-01-01 2007-10-01 False
1 878 2007-10-02 2008-12-01 False
2 878 2008-12-02 2010-04-03 False
3 879 2010-04-04 2199-05-11 False
4 879 2016-05-12 2199-12-31 True
我现在想通过在出现重叠时保持重叠的开始和重叠的结束来扩展解决方案。 我试图让apply返回
中的pd.Seriesdf.groupby('id').apply(lambda x:
pd.Series([x['to'].shift() - x['from'] > pd.Timedelta(seconds=0),
x['from'],
x['to'].shift()],
index=['is_overlap','start_overlap','end_overlap']))
但是结果数据框为完全改变的形状(不再是5行)。 我只是想要
[49]:
id from to is_overlap start_overlap end_overlap
0 878 2006-01-01 2007-10-01 False np.NaT np.NaT
1 878 2007-10-02 2008-12-01 False np.NaT np.NaT
2 878 2008-12-02 2010-04-03 False np.NaT np.NaT
3 879 2010-04-04 2199-05-11 False np.NaT np.NaT
4 879 2016-05-12 2199-12-31 True 2016-05-12 2199-05-11
[1]: https://stackoverflow.com/questions/42462218/find-date-range-overlap-in-python