从简单的行编码到定义或for循环,使下面的代码更有效和更具可读性,我遇到了困难。
我的数据示例(从SQL中提取),如下所示:
+----+-----------+------------+---------+-----------+----------+
| id | member_id | max_date | Recency | Frequency | Monetary |
+----+-----------+------------+---------+-----------+----------+
| 1 | 22 | 2016-09-03 | 818 | 10 | 50 |
| 2 | 34 | 2017-06-27 | 521 | 50 | 100 |
| 3 | 123 | 2018-10-26 | 35 | 5 | 80 |
+----+-----------+------------+---------+-----------+----------+
我正在创建三个新表,因为我需要根据新近度频率和货币列查找总和和总和%,并且这些列需要以不同的顺序排列:
rfm_recency = rfm[['Max_Date', 'Id', 'Member_id', 'Recency']].copy()
rfm_recency = rfm_recency.sort_values(['Recency'], ascending=True)
rfm_recency['cum_sum'] = rfm_recency['Recency'].cumsum()
rfm_recency['cum_sum_perc'] = rfm_recency['cum_sum']/rfm_recency['Recency'].sum()
rfm_frequency = rfm[['Id', 'Frequency']].copy()
rfm_frequency = rfm_frequency.sort_values(['Frequency'], ascending=False)
rfm_frequency['cum_sum'] = rfm_frequency['Frequency'].cumsum()
rfm_frequency['cum_sum_perc'] = rfm_frequency['cum_sum']/rfm_frequency['Frequency'].sum()
rfm_monetary = rfm[['Id', 'Monetary']].copy()
rfm_monetary = rfm_monetary.sort_values(['Monetary'], ascending=False)
rfm_monetary['cum_sum'] = rfm_monetary['Monetary'].cumsum()
rfm_monetary['cum_sum_perc'] = rfm_monetary['cum_sum']/rfm_monetary['Monetary'].sum()
然后基于cum_sum_perc列,我为每个表应用一个函数:
def score(x):
if x <= 0.20:
return 5
elif x <= 0.40:
return 4
elif x <= 0.60:
return 3
elif x <= 0.80:
return 2
else:
return 1
rfm_recency['r_quintile'] = rfm_recency['cum_sum_perc'].apply(score)
rfm_frequency['f_quintile'] = rfm_frequency['cum_sum_perc'].apply(score)
rfm_monetary['m_quintile'] = rfm_monetary['cum_sum_perc'].apply(score)
然后,我在ID上求助于他们,将它们合并到一起:
rfm_recency = rfm_recency.sort_values('Id')
rfm_frequency = rfm_frequency.sort_values('Id')
rfm_monetary = rfm_monetary.sort_values('Id')
result = rfm_recency.copy()
result = result.join(rfm_frequency[['Frequency', 'f_quintile']])
result = result.join(rfm_monetary[['Monetary', 'm_quintile']])
作为Python的新手,我将继续到目前为止的工作,但我知道这可以通过DRAMMATICAlly进行修整。