计算组中过去x天的值

时间:2017-07-18 19:09:48

标签: python pandas

我在以更加pythonic和有效的方式写这篇文章时遇到了麻烦。我正在尝试按customerid对观察进行分组,并计算过去1天,7天和30天内客户被拒绝的每次观察的次数。

t = pd.DataFrame({'customerid': [1,1,1,3,3],
                 'leadid': [10,11,12,13,14], 
                 'postdate': ["2017-01-25 10:55:25.727", "2017-02-02 10:55:25.727", "2017-02-27 10:55:25.727", "2017-01-25 10:55:25.727", "2017-01-25 11:55:25.727"], 
                 'post_status': ['Declined', 'Declined', 'Declined', 'Declined', 'Declined']})
t['postdate'] = pd.to_datetime(t['postdate'])

这是输出:

customerid  leadid  post_status postdate
1   10  Declined    2017-01-25 10:55:25.727
1   11  Declined    2017-02-02 10:55:25.727
1   12  Declined    2017-02-27 10:55:25.727
3   13  Declined    2017-01-25 10:55:25.727
3   14  Declined    2017-01-25 11:55:25.727

我目前的解决方案非常缓慢:

final = []
for customer in t['customerid'].unique():

    temp = t[(t['customerid']==customer) & (t['post_status']=='Declined')].copy()

    for i, row in temp.iterrows():
        date = row['postdate']
        final.append({
            'leadid': row['leadid'],
            'decline_1': temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=1))].shape[0]-1,
            'decline_7': temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=7))].shape[0]-1,
            'decline_30': temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=30))].shape[0]-1
        })

预期输出如下所示:

decline_1   decline_30  decline_7   leadid
0   0   0   10
0   1   0   11
0   1   0   12
0   0   0   13
1   1   1   14

我想我需要某种双组合,我遍历组中的每一行,但除了这个需要很长时间才能完成的双重for循环之外,我无法获得任何工作。

任何帮助都将不胜感激。

1 个答案:

答案 0 :(得分:0)

您可以尝试使用groupbytransform,并使用布尔数组的总和是True的数量,这样您就不需要生成额外的DataFrame每次执行类似temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=7))].shape[0]-1的操作:

def find_declinations(df, period):
    results = pd.Series(index=df.index, name=period)
    for index, date in df.items():
        time_range = df.between(date - period, date)
        results[index] = time_range.sum() - 1
    return results.fillna(0).astype(int)

并像这样称呼它

results = pd.DataFrame(index=t.index)
period=pd.to_timedelta(1, 'd')
for days in [1, 7, 30]:
    results['decline%i'% days] = t.groupby('customerid')[['postdate']].transform(lambda x: find_declinations(x, pd.to_timedelta(days, 'd')))
results.index = t['leadid']
  

结果

    decline1    decline7    decline30
leadid          
10  0   0   0
11  0   0   1
12  0   0   1
13  0   0   0
14  1   1   1

略有不同的方法

该appoach每个时期都会进行一次分组。你可以通过只做1组来加速它,然后计算每组的所有时期

def find_declinations_df(df, periods = [1, 7, 30, 60]):
#     print(periods, type(df), df)
    results = pd.DataFrame(index=pd.DataFrame(df).index, columns=periods)
    for period in periods: 
        for index, date in df['postdate'].items():
            time_range = df['postdate'].between(date - pd.to_timedelta(period, 'd'), date)
            results.loc[index, period] = time_range.sum() - 1
    return results.fillna(0).astype(int)

results = pd.concat(find_declinations_df(group[1]) for group in t.groupby('customerid'))
results['leadid'] = t['leadid']
  

结果

    1   7   30  60  leadid
0   0   0   0   0   10
1   0   0   1   1   11
2   0   0   1   2   12
3   0   0   0   0   13
4   1   1   1   1   14