大熊猫使用groupby加快了多行计算

时间:2018-06-19 18:46:14

标签: python pandas dataframe pandas-groupby

我正在尝试对数据框中的每一行进行多行计算。

我当前的解决方案需要20个小时才能处理20万行。所以效率很低,我希望groupby或其他一些熊猫方法可以在这里为我提供帮助。

例如,我的数据如下(您现在可以忽略日期):

id group start_date end_date   three_yrs_ago_date days_missing
01 23    2005-01-01 2006-01-01 2002-01-01           1
02 23    2006-01-06 2007-01-06 2003-01-06           6
03 23    2007-01-15 2008-01-15 2004-01-15           9
07 17    2014-01-01 2015-02-01 2011-01-01           2
07 23    2015-01-01 2016-02-01 2012-01-01           4

因此,这里的目标是将所有内容按其group号进行分组,然后将过去3年内发生的该组中所有其他行的所有days_missing相加。也就是说,其他行start_date在当前行three_yrs_ago_date处或之后,并且在当前行end_date处或之前。

这是一个可口的,但基本上是三个标准。因此,如果这是整个数据集,我们将得到以下结果(删除日期列):

id group days_missing days_missing_in_last_three_years            
01 23    1            1    # no change: no prior years
02 23    6            7 
03 23    9            16  
07 17    2            2    # no change: only member of it's group
07 23    4            4    # no change: other group members more than 3 years ago

我将向您展示我当前拥有的代码,但这很慢。

我逐行浏览数据框,创建一个包含所有组成员的临时数据框,然后将这些组成员缩减为仅符合日期条件的成员。这不漂亮:

days=[]
for index, row in tqdm(df.iterrows()):
    # moderately slow (~2 hour):
    temp = df[df['group'] == row['group']]
    temp = temp[temp['start_date'] >= row['three_yrs_ago_date']]
    temp = temp[temp['end_date'] <= row['start_date']]
    add = temp['days_missing'].sum() + row['days_missing']
    days.append(add)
df['days_missing_in_last_three_years'] = days

我尝试了另外两种方法,但都没有成功:

# very slow (~3 hours):
cov.append(df[(df['group'] == row['group']) & (df['start_date'] >= row['three_yrs_ago_date']) & (df['end_date'] <= row['start_date'])]['days_missing'].sum()+row['days_missing'])

# doesn't work - incorrect use of groupby
df['test'] = df[(df.groupby(['group'])['start_date'] >= df.groupby(['group'])['three_yrs_ago_date']) & (df.groupby(['group'])['end_date'] <= df.groupby(['group'])['start_date'])]['days_missing'].sum()

是否有比将其分解成较小的临时数据帧并对其进行计算更好的方法呢?

2 个答案:

答案 0 :(得分:1)

这是一种解决方案,可能比您的方法快。在df.groupby('group')上使用循环for,然后在每个分组的数据报apply上使用df_g。您可以使用之间的方法选择df_g的每行两个日期之间的部分

for name, df_g in df.groupby('group'):
    df.loc[df_g.index,'test'] = df_g.apply(lambda row: (df_g['days_missing'][df_g['start_date']
                                                           .between(row['three_yrs_ago_date'], row['end_date'])].sum()),1)
df['test'] = df['test'].astype(int) #to get integer

结果与预期的一样:

   id  group start_date   end_date three_yrs_ago_date  days_missing  test
0   1     23 2005-01-01 2006-01-01         2002-01-01             1     1
1   2     23 2006-01-06 2007-01-06         2003-01-06             6     7
2   3     23 2007-01-15 2008-01-15         2004-01-15             9    16
3   7     17 2014-01-01 2015-02-01         2011-01-01             2     2
4   7     23 2015-01-01 2016-02-01         2012-01-01             4     4

编辑:使用numpy函数的更快方法:

import numpy as np
for name, df_g in df.groupby('group'):
    m_g = ( np.less_equal.outer(df_g['three_yrs_ago_date'], df_g['start_date']) 
            & np.greater_equal.outer(df_g['end_date'], df_g['start_date']) )
    df.loc[df_g.index,'test'] =np.dot(m_g, df_g['days_missing'])
df['test'] = df['test'].astype(int) #to get integer

答案 1 :(得分:1)

这里尝试使用.groupby.loc.transform

import numpy as np

conditions = (
    (df['start_date'] >= df['three_yrs_ago_date'])
    & (df['end_date'] <= df['start_date'])
)
df['test'] = np.nan # initiliaze column, otherwise next line raises KeyError
df.loc[conditions, 'test'] = df.loc[conditions, ].groupby('group')['days_missing'].transform('sum')