我一直在为一家SaaS公司进行同期群分析,而且我一直在使用Greg Rada's示例,我在查找群组保留时遇到了一些麻烦。
现在,我将数据框设置为:
map
到目前为止,我所做的是......
import numpy as np
from pandas import DataFrame, Series
import sys
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
pd.set_option('max_columns', 50)
mpl.rcParams['lines.linewidth'] = 2
%matplotlib inline
df = DataFrame ({
'Customer_ID': ['QWT19CLG2QQ','URL99FXP9VV','EJO15CUP4TO','ZDJ11ZPO5LX','QQW13PUF3HL','SIJ98IQH0GW','EBH36UPB2XR','BED40SMW5NQ','NYW11ZKC8WK','YLV60ERT0VT'],
'Plan_Start_Date': ['2014-01-30', '2014-03-04', '2014-01-27', '2014-02-10', '2014-01-02', '2014-04-15', '2014-05-28', '2014-05-03', '2014-02-09', '2014-06-09']
'Plan_Cancel_Date': ['2014-09-19', '2014-10-29', '2015-01-19', '2015-01-21', '2014-08-19', '2014-08-26', '2014-10-01', '2015-01-03', '2015-01-23', '2015-09-02']
'Monthly_Pay': [14.99, 14.99, 14.99, 14.99, 29.99, 29.99, 29.99, 74.99, 74.99, 74.99]
'Plan_ID' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
})
我尝试过从Plan_Start_Date创建保留列,类似于Greg构建他的方式:
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
#Convert the dates from objects to datetime
df['Cohort'] = df.Plan_Start_Date.map(lambda x: x.strftime('%Y-%m'))
#Create a cohort based on the start dates month and year
df['Lifetime'] = (df.Plan_Cancel_Date.dt.year -
df.Plan_Start_Date.dt.year)*12 + (df.Plan_Cancel_Date.dt.month -
df.Plan_Start_Date.dt.month)
#calculat the total lifetime of each customer
df['Lifetime_Revenue'] = df['Monthly_Pay'] * df['Lifetime']
dfsort = df.sort_values(['Cohort'])
dfsort.head(10)
#Calculate the total revenue of each customer
但这只会重复我数据集中['群组']的价值。
反过来,当我尝试通过以下方式创建索引层次结构以映射保留时:
dfsort['Retention'] = dfsort.groupby(level=0)['Plan_Start_Date'].min().apply(lambda x:
x.strftime('%Y-%m'))
而不是看起来像:
grouped = dfsort.groupby(['Cohort', 'Retention'])
cohorts = grouped.agg({'Customer_ID': pd.Series.nunique})
cohorts.head()
看起来像:
Total_Users
Cohort Retention
-------------------------------
2014-01 2014-01 3
2014-02 3
2014-03 3
...
2015-01 1
2014-02 2014-01 2
2014-02 2
我知道我的分组错误,并创建了保留列,但我对如何修复它感到茫然。有人能帮助一个菜鸟吗?
答案 0 :(得分:0)
您可以使用multi_indexing,然后在2列上进行分组。
dfsort = dfsort.set_index(['Cohort', 'Retention'])
dfsort.groupby(['Cohort', 'Retention']).count()
但是,在您的数据中,您只有一个'保留'每个队列的日期,这就是为什么您没有看到不同的保留日期。
Cohort Retention
---------------------
2014-01 2014-01
2014-01
2014-01
2014-02 2014-02
2014-02
也许你想看看你如何计算群组和保留。