Pandas Dataframe-对于每一行,返回日期重叠的其他行的计数

时间:2019-10-08 19:54:56

标签: python pandas dataframe

我有一个包含项目,开始日期和结束日期的数据框。对于每一行,我想返回项目开始时正在处理的其他项目的数量。使用df.apply()时如何嵌套循环?我尝试过使用for循环,但是我的数据帧很大,并且花费的时间太长。

import datetime as dt

data = {'project' :['A', 'B', 'C'],
        'pr_start_date':[dt.datetime(2018, 9, 1), dt.datetime(2019, 4, 1), dt.datetime(2019, 6, 8)],
        'pr_end_date': [dt.datetime(2019, 6, 15), dt.datetime(2019, 12, 1), dt.datetime(2019, 8, 1)]}

df = pd.DataFrame(data)

def cons_overlap(start):
    overlaps = 0
    for i in df.index:
        other_start = df.loc[i, 'pr_start_date']
        other_end = df.loc[i, 'pr_end_date']
        if (start > other_start) & (start < other_end):
            overlaps += 1

    return overlaps

df['overlap'] = df.apply(lambda row: cons_overlap(row['pr_start_date']), axis=1)

这是我正在寻找的输出:

    pr  pr_start_date pr_end_date   overlap
0   A   2018-09-01    2019-06-15    0
1   B   2019-04-01    2019-12-01    1
2   C   2019-06-08    2019-08-01    2

3 个答案:

答案 0 :(得分:3)

我建议您利用numpy broadcasting

ends = df.pr_start_date.values < df.pr_end_date.values[:, None]
starts = df.pr_start_date.values > df.pr_start_date.values[:, None]
df['overlap'] = (ends & starts).sum(0)
print(df)

输出

  project pr_start_date pr_end_date  overlap
0       A    2018-09-01  2019-06-15        0
1       B    2019-04-01  2019-12-01        1
2       C    2019-06-08  2019-08-01        2

开头和结尾都是3x3的矩阵,当满足条件时,它们就是真值:

# ends   
[[ True  True  True]  
 [ True  True  True]
 [ True  True  True]]

# starts
[[False  True  True]
 [False False  True]
 [False False False]]

然后找到与逻辑&的交点,并求和成列(sum(0))。

答案 1 :(得分:2)

它应该比for循环快

enter image description here

答案 2 :(得分:0)

我假设这些行按开始日期排序,然后检查以前启动的尚未完成的项目。 df.index.get_loc(r.name)产生正在处理的行的索引。

df["overlap"]=df.apply(lambda r: df.loc[:df.index.get_loc(r.name),"pr_end_date"].gt(r["pr_start_date"]).sum()-1, axis=1)