Question

我一直在为一家SaaS公司进行同期群分析，而且我一直在使用Greg Rada's示例，我在查找群组保留时遇到了一些麻烦。

现在，我将数据框设置为：

map

到目前为止，我所做的是......

import numpy as np
from pandas import DataFrame, Series
import sys
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

pd.set_option('max_columns', 50)
mpl.rcParams['lines.linewidth'] = 2

%matplotlib inline

df = DataFrame ({
  'Customer_ID': ['QWT19CLG2QQ','URL99FXP9VV','EJO15CUP4TO','ZDJ11ZPO5LX','QQW13PUF3HL','SIJ98IQH0GW','EBH36UPB2XR','BED40SMW5NQ','NYW11ZKC8WK','YLV60ERT0VT'],
  'Plan_Start_Date': ['2014-01-30', '2014-03-04', '2014-01-27', '2014-02-10', '2014-01-02', '2014-04-15', '2014-05-28', '2014-05-03', '2014-02-09', '2014-06-09']
  'Plan_Cancel_Date': ['2014-09-19', '2014-10-29', '2015-01-19', '2015-01-21', '2014-08-19', '2014-08-26', '2014-10-01', '2015-01-03', '2015-01-23', '2015-09-02']
  'Monthly_Pay': [14.99, 14.99, 14.99, 14.99, 29.99, 29.99, 29.99, 74.99, 74.99, 74.99]
  'Plan_ID' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
})

我尝试过从Plan_Start_Date创建保留列，类似于Greg构建他的方式：

df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
#Convert the dates from objects to datetime

df['Cohort'] = df.Plan_Start_Date.map(lambda x: x.strftime('%Y-%m'))
#Create a cohort based on the start dates month and year

df['Lifetime'] = (df.Plan_Cancel_Date.dt.year - 
df.Plan_Start_Date.dt.year)*12 + (df.Plan_Cancel_Date.dt.month - 
df.Plan_Start_Date.dt.month)
#calculat the total lifetime of each customer

df['Lifetime_Revenue'] = df['Monthly_Pay'] * df['Lifetime']
dfsort = df.sort_values(['Cohort'])
dfsort.head(10)
#Calculate the total revenue of each customer

但这只会重复我数据集中[＆＃39;群组＆＃39;]的价值。

反过来，当我尝试通过以下方式创建索引层次结构以映射保留时：

dfsort['Retention'] = dfsort.groupby(level=0)['Plan_Start_Date'].min().apply(lambda x: 
x.strftime('%Y-%m'))

而不是看起来像：

grouped = dfsort.groupby(['Cohort', 'Retention'])
cohorts = grouped.agg({'Customer_ID': pd.Series.nunique})
cohorts.head()

看起来像：

                  Total_Users 
Cohort  Retention
-------------------------------
2014-01  2014-01        3
         2014-02        3
         2014-03        3
         ...
         2015-01        1
2014-02  2014-01        2
         2014-02        2

我知道我的分组错误，并创建了保留列，但我对如何修复它感到茫然。有人能帮助一个菜鸟吗？

Answer 1

您可以使用multi_indexing，然后在2列上进行分组。

dfsort = dfsort.set_index(['Cohort', 'Retention'])
dfsort.groupby(['Cohort', 'Retention']).count()

但是，在您的数据中，您只有一个＆＃39;保留＆＃39;每个队列的日期，这就是为什么您没有看到不同的保留日期。

Cohort  Retention
---------------------
2014-01    2014-01  
           2014-01  
           2014-01  

2014-02    2014-02
           2014-02

也许你想看看你如何计算群组和保留。

计算每月保留期

1 个答案: