我需要在一个大groupby
查询上运行一个函数,该查询检查两个子组是否有任何重叠日期。以下是单个组tmp
的示例:
ID num start stop subGroup
0 21 10 2006-10-10 2008-10-03 1
1 21 46 2006-10-10 2100-01-01 2
2 21 5 1997-11-25 1998-09-29 1
3 21 42 1998-09-29 2100-01-01 2
4 21 3 1997-01-07 1997-11-25 1
5 21 6 2006-10-10 2008-10-03 1
6 21 47 1998-09-29 2006-10-10 2
7 21 4 1997-01-07 1998-09-29 1
我写的这个函数看起来像这样:
def hasOverlap(tmp):
d2_starts = tmp[tmp['subGroup']==2]['start']
d2_stops = tmp[tmp['subGroup']==2]['stop']
return tmp[tmp['subGroup']==1].apply(lambda row_d1:
(
#Check for part nested D2 in D1
((d2_starts >= row_d1['start']) &
(d2_starts < row_d1['stop']) ) |
((d2_stops >= row_d1['start']) &
(d2_stops < row_d1['stop']) ) |
#Check for fully nested D1 in D2
((d2_stops >= row_d1['stop']) &
(d2_starts <= row_d1['start']) )
).any()
,axis = 1
).any()
问题是这个代码有很多冗余,当我运行查询时:
groups.agg(hasOverlap)
终止需要一段不合理的时间。
是否有任何性能修复(例如使用内置函数或set_index)我可以做些什么来加快速度?
答案 0 :(得分:0)
您是否只想回归&#34; True&#34;或&#34;错误&#34;基于重叠的存在?如果是这样,我只需获取每个子组的日期列表,然后使用pandas isin
方法检查它们是否重叠。
您可以尝试这样的事情:
#split subgroups into separate DF's
group1 = groups[groups.subgroup==1]
group2 = groups[groups.subgroup==2]
#check if any of the start dates from group 2 are in group 1
if len(group1[group1.start.isin(list(group2.start))]) >0:
print "Group1 overlaps group2"
#check if any of the start dates from group 1 are in group 2
if len(group2[group2.start.isin(list(group1.start))]) >0:
print "Group2 overlaps group1"