我有一个用户ID的样本数据集和进行交易的月份。我的目标是按月计算有多少原始用户进行交易。换句话说,1月份 new 的用户数量也在2月,3月和4月进行了交易。 2月份有多少 new 的用户在3月和4月进行了交易,依此类推。
> data
date user_id
1 Jan 2017 1
2 Jan 2017 2
3 Jan 2017 3
4 Jan 2017 4
5 Jan 2017 5
6 Feb 2017 1
7 Feb 2017 3
8 Feb 2017 5
9 Feb 2017 7
10 Feb 2017 9
11 Mar 2017 2
12 Mar 2017 4
13 Mar 2017 6
14 Mar 2017 8
15 Mar 2017 10
16 Apr 2017 1
17 Apr 2017 3
18 Apr 2017 6
19 Apr 2017 9
20 Apr 2017 12
此数据集的输出如下所示:
> output
Jan Feb Mar Apr
Jan 5 3 2 2
Feb NA 2 0 1
Mar NA NA 3 1
Apr NA NA NA 1
到目前为止,我能想到的唯一方法是拆分数据集,然后计算前几个月不存在的每个月的唯一ID,但这种方法很冗长,不适合大型数据集好几个月了。
subsets <-split(data, data$date, drop=TRUE)
for (i in 1:length(subsets)) {
assign(paste0("M", i), as.data.frame(subsets[[i]]))
}
M1_ids <- unique(M1$user_id)
M2_ids <- unique(M2$user_id)
M3_ids <- unique(M3$user_id)
M4_ids <- unique(M4$user_id)
M2_ids <- unique(setdiff(M2_ids, unique(M1_ids)))
M3_ids <- unique(setdiff(M3_ids, unique(c(M2_ids, M1_ids))))
M4_ids <- unique(setdiff(M4_ids, unique(c(M3_ids, M2_ids, M1_ids))))
R中是否有办法使用dplyr
或甚至基数R的较短方法得出上述输出?真实的数据集有很多年和几个月。
数据格式如下:
> sapply(data, class)
date user_id
"yearmon" "integer"
样本数据:
> dput(data)
structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017,
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333,
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667,
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25,
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L,
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L,
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
答案 0 :(得分:2)
以下是一个例子:
library(data.table)
library(zoo)
data <- structure(list(date = structure(c(2017, 2017, 2017, 2017, 2017,
2017.08333333333, 2017.08333333333, 2017.08333333333, 2017.08333333333,
2017.08333333333, 2017.16666666667, 2017.16666666667, 2017.16666666667,
2017.16666666667, 2017.16666666667, 2017.25, 2017.25, 2017.25,
2017.25, 2017.25), class = "yearmon"), user_id = c(1L, 2L, 3L,
4L, 5L, 1L, 3L, 5L, 7L, 9L, 2L, 4L, 6L, 8L, 10L, 1L, 3L, 6L,
9L, 12L)), .Names = c("date", "user_id"), row.names = c(NA, -20L
), class = "data.frame")
data <- data[c(1,1:nrow(data)),]
setDT(data)
(cohorts <- dcast(unique(data)[,cohort:=min(date),by=user_id],cohort~date))
# cohort Jan 2017 Feb 2017 Mrz 2017 Apr 2017
# 1: Jan 2017 5 3 2 2
# 2: Feb 2017 0 2 0 1
# 3: Mrz 2017 0 0 3 1
# 4: Apr 2017 0 0 0 1
m <- as.matrix(cohorts[,-1])
rownames(m) <- cohorts[[1]]
m[lower.tri(m)] <- NA
names(dimnames(m)) <- c("cohort", "yearmon")
m
# yearmon
# cohort Jan 2017 Feb 2017 Mrz 2017 Apr 2017
# Jan 2017 5 3 2 2
# Feb 2017 NA 2 0 1
# Mrz 2017 NA NA 3 1
# Apr 2017 NA NA NA 1
答案 1 :(得分:1)
这在Tidyverse功能中也是可能的:
library(tidyverse)
library(lubridate)
transactions <- tibble(
month=ymd(c("2017-01-01", "2017-01-01", "2017-02-01", "2017-02-01", "2017-03-01")),
user_id=c(1, 2, 1, 3, 3)
)
# Jan 1
# Jan 2
# Feb 1
# Feb 3
# Mar 1
# mark the cohort of the users
users <- transactions %>%
arrange(month, user_id) %>%
group_by(user_id) %>%
top_n(-1, month) %>%
# date of the first transaction
rename(cohort = month)
users
transactions %>%
group_by(month, user_id) %>%
distinct() %>%
left_join(users, by = 'user_id') %>%
xtabs(~ cohort + month, data = .)
# month
# cohort 2017-01-01 2017-02-01 2017-03-01
# 2017-01-01 2 1 0
# 2017-02-01 0 1 1