我有一个2年的用户短信数据集 - 2015年和2016年(135,000)。我正在尝试为2016年2月的此计划确定新用户(基于subscriber_id和entity ==" subscribe-online")。
皱纹是新用户是过去12个月内数据中没有发生subscriber_id的用户。因此,例如,如果我有以下样本数据:
created subscriber_id cellnum entity message msgtxt
2015-21-01 14:03:00 15855 7788826943 tip 100 end
2015-07-12 14:03:00 15839 7788815940 tip 24 tip 24
2015-08-12 14:03:00 15839 7788815940 stop 99 stop
2016-01-01 14:05:00 15800 2508816941 tip 25 tip 25
2016-02-01 16:05:00 15800 2508816941 tip 26 tip 26
2016-03-01 14:05:00 15800 2508816941 tip 27 tip 27
2016-01-02 14:03:00 15855 7788826943 subscribe-online 1 msg 1
2016-01-02 14:03:00 15839 7788815940 subscribe-online 1 msg 1
15855和15839都在2月1日订阅。我希望能够根据最后一次发生的subscriber_id 15855是在2015年1月21日 - 超过12个月的事实,将15855指定为新用户。我想将15839指定为重复用户,因为他们最后一次发生在2015年12月8日(不到12个月)。
创建的(日期)字段采用POSIXct格式。我一直在试图理解循环,并且在这里看看如何使用它。任何帮助将不胜感激。谢谢。
答案 0 :(得分:0)
这是使用dplyr
的潜在解决方案library(dplyr)
df <- data.frame(created = c("2015-21-01 14:03:00","2015-12-07 14:03:00","2015-12-08 14:03:00","2016-01-01 14:05:00","2016-02-01 16:05:00","2016-03-01 14:05:00","2016-01-02 14:03:00","2016-01-02 14:03:00"),
subscriber_id = c(15855,15839,15839,15800,15800,15800,15855,15839),
cellnum = c(7788826943,7788815940,7788815940,2508816941,2508816941,2508816941,7788826943,7788815940),
entity = c("tip","tip","stop","tip","tip","tip","subscribe-online","subscribe-online"),
message = c("100","24","99","25","26","27","1","1"),
msgtxt = c("end","tip 24","stop","tip 25 ","tip 26 ","tip 27 ","msg 1","msg 1"),
stringsAsFactors = FALSE
)
df$created <- as.POSIXct(df$created, format = "%Y-%d-%m %H:%M:%S")
df <- df %>%
arrange(subscriber_id, created) %>%
group_by(subscriber_id) %>%
mutate(new_user = if_else(entity != "subscribe-online", NA, if_else(as.numeric(difftime(created, lag(created), units = "days") > 365) == TRUE, TRUE, NA)))