我正在尝试找到一种更有效的方法,根据特定列(id)查找数据框中的重叠数据范围(每行提供的开始/结束日期)。
数据框在'from'列
上排序我认为有一种方法可以像我一样避免“双重”应用功能...
import pandas as pd
from datetime import datetime
df = pd.DataFrame(columns=['id','from','to'], index=range(5), \
data=[[878,'2006-01-01','2007-10-01'],
[878,'2007-10-02','2008-12-01'],
[878,'2008-12-02','2010-04-03'],
[879,'2010-04-04','2199-05-11'],
[879,'2016-05-12','2199-12-31']])
df['from'] = pd.to_datetime(df['from'])
df['to'] = pd.to_datetime(df['to'])
id from to
0 878 2006-01-01 2007-10-01
1 878 2007-10-02 2008-12-01
2 878 2008-12-02 2010-04-03
3 879 2010-04-04 2199-05-11
4 879 2016-05-12 2199-12-31
我使用“apply”函数循环所有组并在每个组中,我每行使用“apply”:
def check_date_by_id(df):
df['prevFrom'] = df['from'].shift()
df['prevTo'] = df['to'].shift()
def check_date_by_row(x):
if pd.isnull(x.prevFrom) or pd.isnull(x.prevTo):
x['overlap'] = False
return x
latest_start = max(x['from'], x.prevFrom)
earliest_end = min(x['to'], x.prevTo)
x['overlap'] = int((earliest_end - latest_start).days) + 1 > 0
return x
return df.apply(check_date_by_row, axis=1).drop(['prevFrom','prevTo'], axis=1)
df.groupby('id').apply(check_date_by_id)
id from to overlap
0 878 2006-01-01 2007-10-01 False
1 878 2007-10-02 2008-12-01 False
2 878 2008-12-02 2010-04-03 False
3 879 2010-04-04 2199-05-11 False
4 879 2016-05-12 2199-12-31 True
我的代码的灵感来自以下链接:
答案 0 :(得分:6)
您可以移动df['overlap'] = (df['to'].shift()-df['from']) > timedelta(0)
列并直接减去日期时间。
id
在df['overlap'] = (df.groupby('id')
.apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
.reset_index(level=0, drop=True))
分组时应用此功能可能看起来像
>>> df
id from to
0 878 2006-01-01 2007-10-01
1 878 2007-10-02 2008-12-01
2 878 2008-12-02 2010-04-03
3 879 2010-04-04 2199-05-11
4 879 2016-05-12 2199-12-31
>>> df['overlap'] = (df.groupby('id')
.apply(lambda x: (x['to'].shift() - x['from']) > timedelta(0))
.reset_index(level=0, drop=True))
>>> df
id from to overlap
0 878 2006-01-01 2007-10-01 False
1 878 2007-10-02 2008-12-01 False
2 878 2008-12-02 2010-04-03 False
3 879 2010-04-04 2199-05-11 False
4 879 2016-05-12 2199-12-31 True
<强>演示强>
{{1}}
答案 1 :(得分:1)
另一种解决方案。可以将其重写为利用熊猫24及更高版本中的Interval.overlaps。
def overlapping_groups(group):
if len(group) > 1:
for index, row in group.iterrows():
for index2, row2 in group.drop(index).iterrows():
int1 = pd.Interval(row2['start_date'],row2['end_date'], closed = 'both')
if row['start_date'] in int1:
return row['id']
if row['end_date'] in int1:
return row['id']
gcols = ['id']
group_output = df.groupby(gcols,group_keys=False).apply(overlapping_groups)
ids_with_overlap = set(group_output[~group_output.isnull()].reset_index(drop = True))
df[df['id'].isin(ids_with_overlap)]
答案 2 :(得分:1)
您可以将“从”时间与前一个“到”时间进行比较:
df['to'].shift() > df['from']
输出:
0 False
1 False
2 False
3 False
4 True
答案 3 :(得分:0)
您可以对from
列进行排序,然后使用非常有效的滚动应用功能检查它是否与之前的to
列重叠。
df['from'] = pd.DatetimeIndex(df['from']).astype(np.int64)
df['to'] = pd.DatetimeIndex(df['to']).astype(np.int64)
sdf = df.sort_values(by='from')
sdf[["from", "to"]].stack().rolling(window=2).apply(lambda r: 1 if r[1] >= r[0] else 0).unstack()
现在,重叠时段是from=0.0
from to
0 NaN 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 0.0 1.0