我有一个如下所示的数据集:
| ID | Date | Stage |
|----|:----------:|-----------:|
| 1 | 2/1/2017 | Activity 1 |
| 1 | 4/1/2017 | Activity 2 |
| 1 | 5/15/2017 | Activity 1 |
| 1 | 5/20/2017 | Outcome 1 |
| 1 | 9/25/2017 | Activity 3 |
| 1 | 10/1/2017 | Outcome 0 |
| 2 | 4/1/2017 | Activity 1 |
| 2 | 10/5/2017 | Activity 4 |
| 2 | 10/10/2017 | Activity 4 |
| 2 | 10/20/2017 | Outcome 1 |
我已经通过子集化对其进行了转换,以便仅保留阶段=结果0或结果1时的联系人ID,日期和阶段列。然后,我将其余的Stage列变量放入每个唯一Activity的列中。从这里开始,我想计算每个ID和结果的Stage列中的活动,这些活动距离Stage列中的每个Outcome变量少于6个月(180天)。我希望它看起来像这样:
| ID | Date | Stage | Activity 1 | Activity 2 | Activity 3 | Activity 4 |
|----|:----------:|----------:|------------|------------|------------|------------|
| 1 | 5/20/2017 | Outcome 1 | 2 | 1 | 1 | 0 |
| 1 | 10/1/2017 | Outcome 0 | 1 | 0 | 1 | 0 |
| 2 | 10/20/2017 | Outcome 1 | 0 | 0 | 0 | 2 |
请注意,ID 1的第二个结果变量只计算一次活动1,因为第一个实例超过180天。
我通过使用for循环填充以下代码实际解决了这个问题。问题是我正在处理的数据集是数百万行,并且在stage列中实际上有50多个变量。所以for循环正在花费超长时间 - 就像通过几列一样。有没有办法用聚合和应用而不是for循环?我无法弄清楚如何调整每个列名和ID的功能。
#current table
ID <- c(1,1,1,1,1,1,2,2,2,2)
Date <- as.Date(c("2/1/17", "4/1/17", "5/15/17", "5/20/17", "9/25/17",
"10/1/17", "4/1/17", "10/5/17", "10/10/17", "10/20/17"),
format = "%m/%d/%y")
Stage <- c("Activity 1", "Activity 2", "Activity 1", "Outcome 1",
"Activity 3", "Outcome 0", "Activity 1", "Activity 4", "Activity 4",
"Outcome 1")
df <- data.frame(ID, Date, Stage)
df
#desired table
outcomes <- subset(df, Stage == "Outcome 0" | Stage == "Outcome 1",
select = c(ID, Date, Stage))
columnNames <- as.character(unique(df$Stage))
columnNames <- columnNames[!columnNames %in% "Outcome 0"]
columnNames <- columnNames[!columnNames %in% "Outcome 1"]
columnNames
dfModel <- cbind(outcomes, setNames(lapply(columnNames, function(x) x = NA),
columnNames))
dfModel
for(j in 4:ncol(dfModel)){
for(i in 1:nrow(dfModel)){
dfModel[i,j] <- length(df$Stage[dfModel$Date[i] - df$Date <= 180 &
df$ID == dfModel$ID[i] &
df$Stage == colnames(dfModel)[j]])
}
}
dfModel