我试图创建一个队列分析,显示随着时间推移的独特购买的发展,特殊条件是群组只应由在第一个订单上使用折扣券的用户组成。
我的数据集如下所示:
╔════╦═════════════════╦══════════════╦═══════════╗
║ id ║ submitted_by_id ║ submitted_at ║ coupon_id ║
╠════╬═════════════════╬══════════════╬═══════════╣
║ 1 ║ 1 ║ 2015-01-01 ║ ║
║ 2 ║ 2 ║ 2015-01-02 ║ 1 ║
║ 3 ║ 1 ║ 2015-02-02 ║ 1 ║
║ 4 ║ 3 ║ 2015-02-02 ║ ║
║... ║ ... ║ ... ║ ... ║
╚════╩═════════════════╩══════════════╩═══════════╝
所以我可以像这样在整个数据集上创建一个队列分析:
import numpy as np
import pandas as pd
data_set = list(data_set)
df = pd.DataFrame(data_set)
df['OrderPeriod'] = df.submitted_at.apply(lambda x: x.strftime('%Y-%m'))
df.set_index('submitted_by_id', inplace=True)
df['CohortGroup'] = df.groupby(level=0)['submitted_at'].min().apply(lambda x: x.strftime('%Y, %m'))
df.reset_index(inplace=True)
grouped = df.groupby(['CohortGroup', 'OrderPeriod'])
cohorts = grouped.agg({
'submitted_by_id': pd.Series.nunique,
'id': pd.Series.nunique,
})
cohorts.rename(columns={'id': 'TotalOrdersInPeriod', 'submitted_by_id': 'TotalUsers'}, inplace=True);
cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.reset_index(inplace=True)
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)
cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()
cohorts['TotalOrders'] = cohorts.groupby(level=0).TotalOrdersInPeriod.cumsum()
total_buys = cohorts['TotalOrders'].unstack(0).divide(cohort_group_size, axis=1)
这将显示我的同类群组
CohortGroup 2015, 01 2015, 02
CohortPeriod
1 1 1
2 1.5
所以我想要的是以某种方式将我的群组限制为那些第一个订单有coupon_id的客户。
所以我的结果表看起来像这样
CohortGroup 2015, 01 2015, 02
CohortPeriod
1 1 NaN
2 1
我该怎么做?
信用转到http://www.gregreda.com/2015/08/23/cohort-analysis-with-python/
答案 0 :(得分:0)
从:
开始 id submitted_by_id submitted_at coupon_id
0 1 1 2015-01-01 NaN
1 2 2 2015-01-02 1
2 3 1 2015-02-02 1
3 4 3 2015-02-02 NaN
您可以按照以下方式获得同类群组和时间段:
df['order_period'] = pd.to_datetime(df.submitted_at).dt.to_period('M')
df = df.rename(columns={'submitted_by_id': 'customer_id'}).drop(['id', 'submitted_at'], axis=1)
df['cohort_group'] = df.sort_values('order_period').groupby('customer_id')['order_period'].transform(lambda x: x.head(1))
df['cohort_period'] = df.groupby(['cohort_group', 'customer_id'])['order_period'].rank()
customer_id coupon_id order_period cohort_group cohort_period
0 1 NaN 2015-01 2015-01 1
1 2 1 2015-01 2015-01 1
2 1 1 2015-02 2015-01 2
3 3 NaN 2015-02 2015-02 1
现在,您可以过滤掉第一个cohort_period
期间使用优惠券的客户(只有一个样本数据):
coupon_customers = df.groupby(['cohort_group', 'customer_id']).apply(lambda x: x.sort_values('cohort_period').iloc[0]).dropna(subset=['coupon_id']).customer_id.tolist()
[2]
根据每个Series
和customer_id
显示的cohort_group
个cohort_period
:
df = df.set_index(['cohort_group', 'cohort_period']).loc[:, 'customer_id'].to_frame()
customer_id
cohort_group cohort_period
2015-01 1 1
1 2
2 1
2015-02 1 3
您可以使用优惠券获得cohort count
:
cohort_count = df.groupby(level=['cohort_group', 'cohort_period']).count().unstack('cohort_period')
cohort_period 1 2
cohort_group
2015-01 2 1
2015-02 1 NaN
或过滤掉coupon_customers
,没有优惠券:
cohort_count_no_coupons = df[~df.isin(coupon_customers)].groupby(level=['cohort_group', 'cohort_period']).count().unstack('cohort_period')
cohort_period 1 2
cohort_group
2015-01 1 1
2015-02 1 NaN
答案 1 :(得分:0)
感谢Stefan指出我正确的方向,这就是我最终做的事情。我会将Stefans的答案标记为已接受的答案,因为这是导致我提出解决方案的原因
我稍微扩展了测试数据集,所以它现在看起来像这样:
coupon_id final_amount id submitted_at submitted_by_id OrderPeriod
0 NaN 100 1 2015-01-01 14:30:00 1 2015-01
1 1 100 2 2015-01-02 14:31:00 2 2015-01
2 1 100 3 2015-02-02 14:31:00 1 2015-02
3 NaN 100 4 2015-02-02 14:31:00 3 2015-02
4 NaN 100 5 2015-02-02 14:31:00 2 2015-02
5 2 100 6 2015-01-02 14:31:00 4 2015-01
6 2 100 7 2015-02-03 14:31:00 5 2015-02
7 NaN 100 8 2015-01-03 14:31:00 2 2015-01
这是一个Python dictonary:
sample_data = [
{'id': 1,
'submitted_by_id': 1,
'submitted_at': datetime.datetime(2015, 1, 1, 14, 30),
'final_amount': Decimal('100'),
'coupon_id': None,
},
{'id': 2,
'submitted_by_id': 2,
'submitted_at': datetime.datetime(2015, 1, 2, 14, 31),
'final_amount': Decimal('100'),
'coupon_id': 1,
},
{'id': 3,
'submitted_by_id': 1,
'submitted_at': datetime.datetime(2015, 2, 2, 14, 31),
'final_amount': Decimal('100'),
'coupon_id': 1,
},
{'id': 4,
'submitted_by_id': 3,
'submitted_at': datetime.datetime(2015, 2, 2, 14, 31),
'final_amount': Decimal('100'),
'coupon_id': None,
},
{'id': 5,
'submitted_by_id': 2,
'submitted_at': datetime.datetime(2015, 2, 2, 14, 31),
'final_amount': Decimal('100'),
'coupon_id': None,
},
{'id': 6,
'submitted_by_id': 4,
'submitted_at': datetime.datetime(2015, 1, 2, 14, 31),
'final_amount': Decimal('100'),
'coupon_id': 2,
},
{'id': 7,
'submitted_by_id': 5,
'submitted_at': datetime.datetime(2015, 2, 3, 14, 31),
'final_amount': Decimal('100'),
'coupon_id': 2,
},
{'id': 8,
'submitted_by_id': 2,
'submitted_at': datetime.datetime(2015, 1, 3, 14, 31),
'final_amount': Decimal('100'),
'coupon_id': None,
},
]
以下是解决方案:
df = pd.DataFrame(sample_data)
df['OrderPeriod'] = df.submitted_at.dt.to_period('M')
if group in ['used_coupon', 'did_not_use_coupon']:
df2 = df.copy()
df2['CohortGroup'] = df2.sort_values('OrderPeriod').\
groupby('submitted_by_id')['OrderPeriod'].transform(lambda x: x.head(1))
df2['CohortPeriod'] = df2.groupby(
['OrderPeriod', 'submitted_by_id']
)['OrderPeriod'].rank()
coupon_customers = df2.groupby(['CohortGroup', 'submitted_by_id']).apply(
lambda x: x.sort_values('submitted_at').iloc[0]
).dropna(subset=['coupon_id']).submitted_by_id.tolist()
# coupon_customers = [2, 4, 5]
if group == 'used_coupon':
# delete rows in the original dataframe where the customer is not
# in the coupon_customers_list
df = df[df['submitted_by_id'].isin(coupon_customers)]
# group == 'did_not_use_coupon'
else:
# delete rows in the original dataframe where the customer is
# in the coupon_customers_list
df = df[df['submitted_by_id'].isin(coupon_customers)]
# From here it's just the same code as I originally used
df.set_index('submitted_by_id', inplace=True)
df['CohortGroup'] = df.groupby(level=0)['submitted_at'].min().apply(lambda x: x.to_period('M'))
df.reset_index(inplace=True)
print df.head()
grouped = df.groupby(['CohortGroup', 'OrderPeriod'])
cohorts = grouped.agg({
'submitted_by_id': pd.Series.nunique,
'id': pd.Series.nunique,
})
cohorts.rename(columns={'id': 'TotalOrdersInPeriod', 'submitted_by_id': 'TotalUsers'}, inplace=True);
cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.reset_index(inplace=True)
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)
cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()
cohorts['TotalOrders'] = cohorts.groupby(level=0).TotalOrdersInPeriod.cumsum()
total_buys = cohorts['TotalOrders'].unstack(0).divide(cohort_group_size, axis=1)
group =' used_coupon':
的结果CohortPeriod 1 2
CohortGroup
2015-01 1.50 2.00
2015-02 1.00