我在以更加pythonic和有效的方式写这篇文章时遇到了麻烦。我正在尝试按customerid对观察进行分组,并计算过去1天,7天和30天内客户被拒绝的每次观察的次数。
t = pd.DataFrame({'customerid': [1,1,1,3,3],
'leadid': [10,11,12,13,14],
'postdate': ["2017-01-25 10:55:25.727", "2017-02-02 10:55:25.727", "2017-02-27 10:55:25.727", "2017-01-25 10:55:25.727", "2017-01-25 11:55:25.727"],
'post_status': ['Declined', 'Declined', 'Declined', 'Declined', 'Declined']})
t['postdate'] = pd.to_datetime(t['postdate'])
这是输出:
customerid leadid post_status postdate
1 10 Declined 2017-01-25 10:55:25.727
1 11 Declined 2017-02-02 10:55:25.727
1 12 Declined 2017-02-27 10:55:25.727
3 13 Declined 2017-01-25 10:55:25.727
3 14 Declined 2017-01-25 11:55:25.727
我目前的解决方案非常缓慢:
final = []
for customer in t['customerid'].unique():
temp = t[(t['customerid']==customer) & (t['post_status']=='Declined')].copy()
for i, row in temp.iterrows():
date = row['postdate']
final.append({
'leadid': row['leadid'],
'decline_1': temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=1))].shape[0]-1,
'decline_7': temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=7))].shape[0]-1,
'decline_30': temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=30))].shape[0]-1
})
预期输出如下所示:
decline_1 decline_30 decline_7 leadid
0 0 0 10
0 1 0 11
0 1 0 12
0 0 0 13
1 1 1 14
我想我需要某种双组合,我遍历组中的每一行,但除了这个需要很长时间才能完成的双重for循环之外,我无法获得任何工作。
任何帮助都将不胜感激。
答案 0 :(得分:0)
您可以尝试使用groupby
和transform
,并使用布尔数组的总和是True
的数量,这样您就不需要生成额外的DataFrame每次执行类似temp[(temp['postdate'] <= date) & (temp['postdate']>=date-timedelta(days=7))].shape[0]-1
的操作:
def find_declinations(df, period):
results = pd.Series(index=df.index, name=period)
for index, date in df.items():
time_range = df.between(date - period, date)
results[index] = time_range.sum() - 1
return results.fillna(0).astype(int)
并像这样称呼它
results = pd.DataFrame(index=t.index)
period=pd.to_timedelta(1, 'd')
for days in [1, 7, 30]:
results['decline%i'% days] = t.groupby('customerid')[['postdate']].transform(lambda x: find_declinations(x, pd.to_timedelta(days, 'd')))
results.index = t['leadid']
结果
decline1 decline7 decline30
leadid
10 0 0 0
11 0 0 1
12 0 0 1
13 0 0 0
14 1 1 1
该appoach每个时期都会进行一次分组。你可以通过只做1组来加速它,然后计算每组的所有时期
def find_declinations_df(df, periods = [1, 7, 30, 60]):
# print(periods, type(df), df)
results = pd.DataFrame(index=pd.DataFrame(df).index, columns=periods)
for period in periods:
for index, date in df['postdate'].items():
time_range = df['postdate'].between(date - pd.to_timedelta(period, 'd'), date)
results.loc[index, period] = time_range.sum() - 1
return results.fillna(0).astype(int)
results = pd.concat(find_declinations_df(group[1]) for group in t.groupby('customerid'))
results['leadid'] = t['leadid']
结果
1 7 30 60 leadid
0 0 0 0 0 10
1 0 0 1 1 11
2 0 0 1 2 12
3 0 0 0 0 13
4 1 1 1 1 14