r

时间:2017-11-24 21:53:19

标签: r shiny dplyr

我是R编程的新手,我想计算一个用户在他注册的月份是否有效。因此,我有两个表 - 一个叫 workouts和另一个registrations。用户在CohortId列中按队列分类。 我想知道的是计算registrationsworkouts的同类日期之间的差异,以查看用户在首次注册的月份是否处于活动状态。

这是我到目前为止所做的:

week_difference <- function(end_date, start_date){
    as.integer(difftime(head(strptime(end_date, format = "%Y-%m-%d"), 1),
               tail(strptime(start_date, format = "%Y-%m-%d"),1), units = "weeks"),0)
}


retention_week <- funnel_workout %>% group_by(userId) %>%  select(userId, cohortId) %>% 
  mutate(week_number = if(!is.na(cohortId)){week_difference(funnel_registration$cohortId, funnel_workout$cohortId)}else{print(NA)})

问题是week_number总是4,并不能真正计算出日期之间的差异。

提前感谢您的任何帮助!

编辑:

以下是注册df:

userId   cohortId   funnelStep
8991eb20 2017-10-23 registration
34ed55c1 2017-08-24 registration

和锻炼df:

userId   cohortId      funnelStep week_number
8991eb20 2017-10-23 completeWorkout           4
34ed55c1 2017-10-18 completeWorkout           4

1 个答案:

答案 0 :(得分:1)

正如KppatelPatel建议的那样,我喜欢lubridate。图书馆和数据:

library(lubridate)
library(dplyr)

registry <- read.table(text = 'userId   cohortId   funnelStep
8991eb20 2017-10-23 registration
34ed55c1 2017-08-24 registration', header = TRUE)

workouts <- read.table(text = 'userId   cohortId      funnelStep week_number
8991eb20 2017-10-23 completeWorkout           4
34ed55c1 2017-10-18 completeWorkout           4', header = TRUE)

您的数据可能已经将您的周数作为日期对象(默认情况下read.table将示例日期作为因子读取) - 如果不是:

registry$cohortId <- as.Date(registry$cohortId)
workouts$cohortId <- as.Date(workouts$cohortId)

创建一个函数来检查日期之间的时间是否大于一个月。功能步骤是:

  • 根据用户ID列
  • 一起加入注册表和锻炼表
  • 创建一个名为&#39; active.1st.month&#39;的新列。并为该列分配逻辑测试的结果,该结果检查注册表同类群组ID和锻炼队列ID之间的差异是否大于一个月
  • 构建仅包含userIdcohortId.xcohortId.yactive.1st.month列的数据框,并将其重命名为更具描述性
  • 返回名称很好的数据框
check_activity <- function(reg.df, work.df){
  reg.work <- inner_join(registry, workouts, by = "userId")
  reg.work$active.1st.month <- 
    (reg.work$cohortId.y - reg.work$cohortId.x) < as.duration(months(1))
  reg.work <- reg.work[,c("userId", "cohortId.x", "cohortId.y", "active.1st.month")]
  names(reg.work) <- c("user", "registered", "workout", "active.1st.month")
  return(reg.work)
}

> check_activity(registry, workouts)
      user registered    workout active.1st.month
1 8991eb20 2017-10-23 2017-10-23             TRUE
2 34ed55c1 2017-08-24 2017-10-18            FALSE

当然,您可以将months(1)更改为您喜欢的任何时间长度(例如weeks(4))。

修改

根据your comment,我认为最简单的方法就是返回用户第一次活动的月份(第一次完成锻炼)。有了这个新的假数据:

registry <- read.table(text = 'userId   cohortId   funnelStep
8991eb20 2017-10-23 registration
example1 2017-10-23 registration
example2 2017-10-23 registration
34ed55c1 2017-08-24 registration', header = TRUE)

workouts <- read.table(text = 'userId   cohortId      funnelStep week_number
8991eb20 2017-10-23 completeWorkout           4
example1 2017-10-28 completeWorkout           4
example2 2017-11-28 completeWorkout           4
34ed55c1 2017-12-18 completeWorkout           4', header = TRUE)

现在稍微更改功能,以便输出给定用户首次活动的月份数(已完成锻炼)。

check_active_month <- function(reg.df, work.df){
  reg.work <- inner_join(registry, workouts, by = "userId")
  reg.work$active.month <- 
    1 + floor(as.duration(workouts$cohortId - registry$cohortId) / as.duration(months(1)))
  reg.work <- reg.work[,c("userId", "cohortId.x", "cohortId.y", "active.month")]
  names(reg.work) <- c("user", "registered", "workout", "active.month")
  return(reg.work)
}

现在,您可以总结具有给定active.month的用户的长度,以输出在注册,第二个月等后的第一个月内处于活动状态的用户数量:

check_active_month(registry, workouts)
active.months.df %>% 
  group_by(active.month) %>%
  summarise(n.users.active.month = length(active.month))

# A tibble: 3 x 2
  active.month n.users.active.month
         <dbl>                <int>
1            1                    2
2            2                    1
3            4                    2