我是R编程的新手,我想计算一个用户在他注册的月份是否有效。因此,我有两个表 - 一个叫
workouts
和另一个registrations
。用户在CohortId列中按队列分类。
我想知道的是计算registrations
和workouts
的同类日期之间的差异,以查看用户在首次注册的月份是否处于活动状态。
这是我到目前为止所做的:
week_difference <- function(end_date, start_date){
as.integer(difftime(head(strptime(end_date, format = "%Y-%m-%d"), 1),
tail(strptime(start_date, format = "%Y-%m-%d"),1), units = "weeks"),0)
}
retention_week <- funnel_workout %>% group_by(userId) %>% select(userId, cohortId) %>%
mutate(week_number = if(!is.na(cohortId)){week_difference(funnel_registration$cohortId, funnel_workout$cohortId)}else{print(NA)})
问题是week_number
总是4,并不能真正计算出日期之间的差异。
提前感谢您的任何帮助!
编辑:
以下是注册df:
userId cohortId funnelStep
8991eb20 2017-10-23 registration
34ed55c1 2017-08-24 registration
和锻炼df:
userId cohortId funnelStep week_number
8991eb20 2017-10-23 completeWorkout 4
34ed55c1 2017-10-18 completeWorkout 4
答案 0 :(得分:1)
正如KppatelPatel建议的那样,我喜欢lubridate
。图书馆和数据:
library(lubridate)
library(dplyr)
registry <- read.table(text = 'userId cohortId funnelStep
8991eb20 2017-10-23 registration
34ed55c1 2017-08-24 registration', header = TRUE)
workouts <- read.table(text = 'userId cohortId funnelStep week_number
8991eb20 2017-10-23 completeWorkout 4
34ed55c1 2017-10-18 completeWorkout 4', header = TRUE)
您的数据可能已经将您的周数作为日期对象(默认情况下read.table
将示例日期作为因子读取) - 如果不是:
registry$cohortId <- as.Date(registry$cohortId)
workouts$cohortId <- as.Date(workouts$cohortId)
创建一个函数来检查日期之间的时间是否大于一个月。功能步骤是:
userId
,cohortId.x
,cohortId.y
和active.1st.month
列的数据框,并将其重命名为更具描述性check_activity <- function(reg.df, work.df){
reg.work <- inner_join(registry, workouts, by = "userId")
reg.work$active.1st.month <-
(reg.work$cohortId.y - reg.work$cohortId.x) < as.duration(months(1))
reg.work <- reg.work[,c("userId", "cohortId.x", "cohortId.y", "active.1st.month")]
names(reg.work) <- c("user", "registered", "workout", "active.1st.month")
return(reg.work)
}
> check_activity(registry, workouts)
user registered workout active.1st.month
1 8991eb20 2017-10-23 2017-10-23 TRUE
2 34ed55c1 2017-08-24 2017-10-18 FALSE
当然,您可以将months(1)
更改为您喜欢的任何时间长度(例如weeks(4)
)。
修改强>
根据your comment,我认为最简单的方法就是返回用户第一次活动的月份(第一次完成锻炼)。有了这个新的假数据:
registry <- read.table(text = 'userId cohortId funnelStep
8991eb20 2017-10-23 registration
example1 2017-10-23 registration
example2 2017-10-23 registration
34ed55c1 2017-08-24 registration', header = TRUE)
workouts <- read.table(text = 'userId cohortId funnelStep week_number
8991eb20 2017-10-23 completeWorkout 4
example1 2017-10-28 completeWorkout 4
example2 2017-11-28 completeWorkout 4
34ed55c1 2017-12-18 completeWorkout 4', header = TRUE)
现在稍微更改功能,以便输出给定用户首次活动的月份数(已完成锻炼)。
check_active_month <- function(reg.df, work.df){
reg.work <- inner_join(registry, workouts, by = "userId")
reg.work$active.month <-
1 + floor(as.duration(workouts$cohortId - registry$cohortId) / as.duration(months(1)))
reg.work <- reg.work[,c("userId", "cohortId.x", "cohortId.y", "active.month")]
names(reg.work) <- c("user", "registered", "workout", "active.month")
return(reg.work)
}
现在,您可以总结具有给定active.month
的用户的长度,以输出在注册,第二个月等后的第一个月内处于活动状态的用户数量:
check_active_month(registry, workouts)
active.months.df %>%
group_by(active.month) %>%
summarise(n.users.active.month = length(active.month))
# A tibble: 3 x 2
active.month n.users.active.month
<dbl> <int>
1 1 2
2 2 1
3 4 2