队列组的情况

时间:2016-01-13 16:38:16

标签: python pandas statistics

我试图创建一个队列分析,显示随着时间推移的独特购买的发展,特殊条件是群组只应由在第一个订单上使用折扣券的用户组成。

我的数据集如下所示:

╔════╦═════════════════╦══════════════╦═══════════╗
║ id ║ submitted_by_id ║ submitted_at ║ coupon_id ║
╠════╬═════════════════╬══════════════╬═══════════╣
║  1 ║               1 ║ 2015-01-01   ║           ║
║  2 ║               2 ║ 2015-01-02   ║         1 ║
║  3 ║               1 ║ 2015-02-02   ║         1 ║
║  4 ║               3 ║ 2015-02-02   ║           ║
║... ║             ... ║        ...   ║       ... ║
╚════╩═════════════════╩══════════════╩═══════════╝

所以我可以像这样在整个数据集上创建一个队列分析:

import numpy as np
import pandas as pd

data_set = list(data_set)
df = pd.DataFrame(data_set)
df['OrderPeriod'] = df.submitted_at.apply(lambda x: x.strftime('%Y-%m'))

df.set_index('submitted_by_id', inplace=True)
df['CohortGroup'] = df.groupby(level=0)['submitted_at'].min().apply(lambda x: x.strftime('%Y, %m'))
df.reset_index(inplace=True)

grouped = df.groupby(['CohortGroup', 'OrderPeriod'])

cohorts = grouped.agg({
    'submitted_by_id': pd.Series.nunique,
    'id': pd.Series.nunique,
})

cohorts.rename(columns={'id': 'TotalOrdersInPeriod', 'submitted_by_id': 'TotalUsers'}, inplace=True);

cohorts = cohorts.groupby(level=0).apply(cohort_period)
cohorts.reset_index(inplace=True)
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)

cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()
cohorts['TotalOrders'] = cohorts.groupby(level=0).TotalOrdersInPeriod.cumsum()

total_buys = cohorts['TotalOrders'].unstack(0).divide(cohort_group_size, axis=1)

这将显示我的同类群组

CohortGroup     2015, 01    2015, 02
CohortPeriod                                                               
1               1           1
2               1.5

所以我想要的是以某种方式将我的群组限制为那些第一个订单有coupon_id的客户。

所以我的结果表看起来像这样

CohortGroup     2015, 01    2015, 02
CohortPeriod                                                               
1               1           NaN
2               1

我该怎么做?

信用转到http://www.gregreda.com/2015/08/23/cohort-analysis-with-python/

2 个答案:

答案 0 :(得分:0)

从:

开始
   id  submitted_by_id submitted_at  coupon_id
0   1                1   2015-01-01        NaN
1   2                2   2015-01-02          1
2   3                1   2015-02-02          1
3   4                3   2015-02-02        NaN

您可以按照以下方式获得同类群组和时间段:

df['order_period'] = pd.to_datetime(df.submitted_at).dt.to_period('M')
df = df.rename(columns={'submitted_by_id': 'customer_id'}).drop(['id', 'submitted_at'], axis=1)
df['cohort_group'] = df.sort_values('order_period').groupby('customer_id')['order_period'].transform(lambda x: x.head(1))
df['cohort_period'] = df.groupby(['cohort_group', 'customer_id'])['order_period'].rank()

   customer_id  coupon_id order_period cohort_group  cohort_period
0            1        NaN      2015-01      2015-01              1
1            2          1      2015-01      2015-01              1
2            1          1      2015-02      2015-01              2
3            3        NaN      2015-02      2015-02              1

现在,您可以过滤掉第一个cohort_period期间使用优惠券的客户(只有一个样本数据):

coupon_customers = df.groupby(['cohort_group', 'customer_id']).apply(lambda x: x.sort_values('cohort_period').iloc[0]).dropna(subset=['coupon_id']).customer_id.tolist()

[2]

根据每个Seriescustomer_id显示的cohort_groupcohort_period

df = df.set_index(['cohort_group', 'cohort_period']).loc[:, 'customer_id'].to_frame()

                            customer_id
cohort_group cohort_period             
2015-01      1                        1
             1                        2
             2                        1
2015-02      1                        3

您可以使用优惠券获得cohort count

cohort_count = df.groupby(level=['cohort_group', 'cohort_period']).count().unstack('cohort_period')

cohort_period           1   2
cohort_group                 
2015-01                 2   1
2015-02                 1 NaN

或过滤掉coupon_customers,没有优惠券:

cohort_count_no_coupons = df[~df.isin(coupon_customers)].groupby(level=['cohort_group', 'cohort_period']).count().unstack('cohort_period')

cohort_period           1   2
cohort_group                 
2015-01                 1   1
2015-02                 1 NaN

答案 1 :(得分:0)

感谢Stefan指出我正确的方向,这就是我最终做的事情。我会将Stefans的答案标记为已接受的答案,因为这是导致我提出解决方案的原因

我稍微扩展了测试数据集,所以它现在看起来像这样:

coupon_id final_amount  id        submitted_at  submitted_by_id OrderPeriod
0        NaN          100   1 2015-01-01 14:30:00                1     2015-01
1          1          100   2 2015-01-02 14:31:00                2     2015-01
2          1          100   3 2015-02-02 14:31:00                1     2015-02
3        NaN          100   4 2015-02-02 14:31:00                3     2015-02
4        NaN          100   5 2015-02-02 14:31:00                2     2015-02
5          2          100   6 2015-01-02 14:31:00                4     2015-01
6          2          100   7 2015-02-03 14:31:00                5     2015-02
7        NaN          100   8 2015-01-03 14:31:00                2     2015-01

这是一个Python dictonary:

sample_data = [
        {'id': 1,
         'submitted_by_id': 1,
         'submitted_at': datetime.datetime(2015, 1, 1, 14, 30),
         'final_amount': Decimal('100'),
         'coupon_id': None,
         },
        {'id': 2,
         'submitted_by_id': 2,
         'submitted_at': datetime.datetime(2015, 1, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': 1,
         },
        {'id': 3,
         'submitted_by_id': 1,
         'submitted_at': datetime.datetime(2015, 2, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': 1,
         },
        {'id': 4,
         'submitted_by_id': 3,
         'submitted_at': datetime.datetime(2015, 2, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': None,
         },
        {'id': 5,
         'submitted_by_id': 2,
         'submitted_at': datetime.datetime(2015, 2, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': None,
         },
        {'id': 6,
         'submitted_by_id': 4,
         'submitted_at': datetime.datetime(2015, 1, 2, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': 2,
         },
        {'id': 7,
         'submitted_by_id': 5,
         'submitted_at': datetime.datetime(2015, 2, 3, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': 2,
         },
        {'id': 8,
         'submitted_by_id': 2,
         'submitted_at': datetime.datetime(2015, 1, 3, 14, 31),
         'final_amount': Decimal('100'),
         'coupon_id': None,
         },
    ]

以下是解决方案:

df = pd.DataFrame(sample_data)
df['OrderPeriod'] = df.submitted_at.dt.to_period('M')

if group in ['used_coupon', 'did_not_use_coupon']:
    df2 = df.copy()

    df2['CohortGroup'] = df2.sort_values('OrderPeriod').\
        groupby('submitted_by_id')['OrderPeriod'].transform(lambda x: x.head(1))
    df2['CohortPeriod'] = df2.groupby(
        ['OrderPeriod', 'submitted_by_id']
    )['OrderPeriod'].rank()

    coupon_customers = df2.groupby(['CohortGroup', 'submitted_by_id']).apply(
            lambda x: x.sort_values('submitted_at').iloc[0]
    ).dropna(subset=['coupon_id']).submitted_by_id.tolist()

    # coupon_customers = [2, 4, 5]

    if group == 'used_coupon':
        # delete rows in the original dataframe where the customer is not
        # in the coupon_customers_list
        df = df[df['submitted_by_id'].isin(coupon_customers)]
    # group == 'did_not_use_coupon'
    else: 
        # delete rows in the original dataframe where the customer is
        # in the coupon_customers_list
        df = df[df['submitted_by_id'].isin(coupon_customers)]

# From here it's just the same code as I originally used
df.set_index('submitted_by_id', inplace=True)
df['CohortGroup'] = df.groupby(level=0)['submitted_at'].min().apply(lambda x: x.to_period('M'))

df.reset_index(inplace=True)
print df.head()

grouped = df.groupby(['CohortGroup', 'OrderPeriod'])

cohorts = grouped.agg({
    'submitted_by_id': pd.Series.nunique,
    'id': pd.Series.nunique,
})

cohorts.rename(columns={'id': 'TotalOrdersInPeriod', 'submitted_by_id': 'TotalUsers'}, inplace=True);

cohorts = cohorts.groupby(level=0).apply(cohort_period)

cohorts.reset_index(inplace=True)
cohorts.set_index(['CohortGroup', 'CohortPeriod'], inplace=True)

cohort_group_size = cohorts['TotalUsers'].groupby(level=0).first()

cohorts['TotalOrders'] = cohorts.groupby(level=0).TotalOrdersInPeriod.cumsum()

total_buys = cohorts['TotalOrders'].unstack(0).divide(cohort_group_size, axis=1)

group =' used_coupon':

的结果
CohortPeriod    1       2
CohortGroup     
2015-01         1.50    2.00
2015-02         1.00