滚动窗口计数pandas中的日期间隔

时间:2017-08-06 22:08:16

标签: python pandas numpy optimization

我有项目历史及其相关的计划开始和结束时间:

id   planned_start planned_end
1    2017-09-12    2017-09-13
2    2017-09-12    2017-09-14
3    2017-09-12    2017-09-13
4    2017-09-13    2017-09-13
5    2017-09-12    2017-09-12
6    2017-09-12    2017-09-20
7    2017-09-14    2017-09-15
8    2017-09-14    2017-09-20

我想计算上述项目的每个开始日期的并发项目数。这是我的逻辑:

for project_id in df['id']:
    start_date = df[df['id'] == project_id]['planned_start'].values[0]
    concurrent_projects = df[(df['planned_start'] <= start_date) & (df['planned_end'] >= start_date)]
    df.ix[df['id'] == project_id, 'concurrent_projects'] = concurrent_projects.shape[0]

产生这个:

   id planned_start planned_end  concurrent_projects
0   1    2017-09-12  2017-09-13                  5.0
1   2    2017-09-12  2017-09-14                  5.0
2   3    2017-09-12  2017-09-13                  5.0
3   4    2017-09-13  2017-09-13                  5.0
4   5    2017-09-12  2017-09-12                  5.0
5   6    2017-09-12  2017-09-20                  5.0
6   7    2017-09-14  2017-09-15                  4.0
7   8    2017-09-14  2017-09-20                  4.0

但是,我知道上面的for循环是多么次优,时间紧迫。实际上,我有超过500,000个项目需要我做这个数学。有人可以就如何提高速度提供一些建议吗?我知道必须有一个纯粹的熊猫甚至是笨拙的解决方案才能杀死我上面的东西。

3 个答案:

答案 0 :(得分:2)

矢量化方式......但会炸毁内存。仍然致力于更好的矢量化方式。我有概念,只是在我吃晚餐时处理细节。

extern "C" {
#include <gifti_io.h>
}

int main(const int argc, const char *argv[]) {
  // rest as before ...

道歉,我没有时间解释。但我不想让你挂起

s = df.planned_start.values
e = df.planned_end.values

s_ = s >= s[:, None]
e_ = s <= e[:, None]

df.assign(concurrent_projects=(e_ & s_).sum(0))

   id planned_start planned_end  concurrent_projects
0   1    2017-09-12  2017-09-13                    5
1   2    2017-09-12  2017-09-14                    5
2   3    2017-09-12  2017-09-13                    5
3   4    2017-09-13  2017-09-13                    5
4   5    2017-09-12  2017-09-12                    5
5   6    2017-09-12  2017-09-20                    5
6   7    2017-09-14  2017-09-15                    4
7   8    2017-09-14  2017-09-20                    4

答案 1 :(得分:2)

这是我的解决方案,使用crosstab,基本上使用martix个均衡进行计算(输入Dataframe df2):

df=pd.crosstab(df2.planned_end,df2.planned_start,margins=True)
df=pd.concat([df,pd.DataFrame(columns=list(set(df.index)- set(df.columns)))]).fillna(0)
df2['concurrent_projects']=df2.planned_start.map(df.loc['All',:].cumsum()-df.All.cumsum().shift().fillna(0))



df2
Out[112]: 
   id planned_start planned_end  concurrent_projects
0   1    2017-09-12  2017-09-13                  5.0
1   2    2017-09-12  2017-09-14                  5.0
2   3    2017-09-12  2017-09-13                  5.0
3   4    2017-09-13  2017-09-13                  5.0
4   5    2017-09-12  2017-09-12                  5.0
5   6    2017-09-12  2017-09-20                  5.0
6   7    2017-09-14  2017-09-15                  4.0
7   8    2017-09-14  2017-09-20                  4.0

答案 2 :(得分:1)

使用apply可提供大约3倍的加速。

目前的做法:

%%timeit
def concurrent_count_using_loop():
    for project_id in df['id']:
        start_date = df[df['id'] == project_id]['planned_start'].values[0]
        concurrent_projects = df[(df['planned_start'] <= start_date) & (df['planned_end'] >= start_date)]
        df.ix[df['id'] == project_id, 'concurrent_projects'] = concurrent_projects.shape[0]

concurrent_count_using_loop()

# 10 loops, best of 3: 21.4 ms per loop

使用apply()

%%timeit
def concurrent_count(project):
    valid_start = df.planned_start <= project["planned_start"]
    valid_end = df.planned_end >= project["planned_start"]
    return (valid_start & valid_end).sum()

df["concurrent_projects"] = df.apply(concurrent_count, axis=1)

# 100 loops, best of 3: 6.94 ms per loop