如何在R

时间:2018-01-26 22:36:55

标签: r for-loop reshape

我有一个如下所示的数据集:

| ID |    Date    |      Stage |
|----|:----------:|-----------:|
| 1  |  2/1/2017  | Activity 1 |
| 1  |  4/1/2017  | Activity 2 |
| 1  |  5/15/2017 | Activity 1 |
| 1  | 5/20/2017  | Outcome 1  |
| 1  | 9/25/2017  | Activity 3 |
| 1  | 10/1/2017  | Outcome 0  |
| 2  | 4/1/2017   | Activity 1 |
| 2  | 10/5/2017  | Activity 4 |
| 2  | 10/10/2017 | Activity 4 |
| 2  | 10/20/2017 | Outcome 1  |

我已经通过子集化对其进行了转换,以便仅保留阶段=结果0或结果1时的联系人ID,日期和阶段列。然后,我将其余的Stage列变量放入每个唯一Activity的列中。从这里开始,我想计算每个ID和结果的Stage列中的活动,这些活动距离Stage列中的每个Outcome变量少于6个月(180天)。我希望它看起来像这样:

| ID |    Date    |     Stage | Activity 1 | Activity 2 | Activity 3 | Activity 4 |
|----|:----------:|----------:|------------|------------|------------|------------|
| 1  |  5/20/2017 | Outcome 1 | 2          | 1          | 1          | 0          |
| 1  |  10/1/2017 | Outcome 0 | 1          | 0          | 1          | 0          |
| 2  | 10/20/2017 | Outcome 1 | 0          | 0          | 0          | 2          |

请注意,ID 1的第二个结果变量只计算一次活动1,因为第一个实例超过180天。

我通过使用for循环填充以下代码实际解决了这个问题。问题是我正在处理的数据集是数百万行,并且在stage列中实际上有50多个变量。所以for循环正在花费超长时间 - 就像通过几列一样。有没有办法用聚合和应用而不是for循环?我无法弄清楚如何调整每个列名和ID的功能。

#current table
ID <- c(1,1,1,1,1,1,2,2,2,2)
Date <- as.Date(c("2/1/17", "4/1/17", "5/15/17", "5/20/17", "9/25/17", 
                  "10/1/17", "4/1/17", "10/5/17", "10/10/17", "10/20/17"),
                format = "%m/%d/%y")
Stage <- c("Activity 1", "Activity 2", "Activity 1", "Outcome 1", 
           "Activity 3", "Outcome 0", "Activity 1", "Activity 4", "Activity 4", 
           "Outcome 1")
df <- data.frame(ID, Date, Stage)
df

#desired table
outcomes <- subset(df, Stage == "Outcome 0" | Stage == "Outcome 1", 
                   select = c(ID, Date, Stage))
columnNames <- as.character(unique(df$Stage))
columnNames <- columnNames[!columnNames %in% "Outcome 0"]
columnNames <- columnNames[!columnNames %in% "Outcome 1"]
columnNames

dfModel <- cbind(outcomes, setNames(lapply(columnNames, function(x) x = NA), 
                                    columnNames))
dfModel

for(j in 4:ncol(dfModel)){
  for(i in 1:nrow(dfModel)){
    dfModel[i,j] <- length(df$Stage[dfModel$Date[i] - df$Date <= 180 &
                                    df$ID == dfModel$ID[i] &
                                    df$Stage == colnames(dfModel)[j]])
  }
}
dfModel

0 个答案:

没有答案