为变量的每个观察值创建列

时间:2019-05-15 03:17:42

标签: r dplyr

假设要求受访者(id)在五个任务中进行选择(t = 1,2,3,4,5)(一个面板数据集,每个受访者具有五个观察值)。一旦做出选择,然后将结果显示给受访者。假设数据如下所示。

+----+---+---------+
| id | t | outcome |
+----+---+---------+
|  1 | 1 |      10 |
|  1 | 2 |      20 |
|  1 | 3 |      30 |
|  1 | 4 |      40 |
|  1 | 5 |      40 |
|  2 | 1 |      20 |
|  2 | 2 |      30 |
|  2 | 3 |      40 |
|  2 | 4 |      40 |
|  2 | 5 |      20 |
|  . | . |       . |
|  . | . |       . |
|  . | . |       . |
+----+---+---------+

现在,我有兴趣保留每个 t-1 任务的结果变量的历史记录。我的目标是以下输出。


+----+---+---------+------------+------------+------------+------------+------------+
| id | t | outcome | outcome_t1 | outcome_t2 | outcome_t3 | outcome_t4 | outcome_t5 |
+----+---+---------+------------+------------+------------+------------+------------+
|  1 | 1 |      10 | NA         | NA         | NA         | NA         | NA         |
|  1 | 2 |      20 | 10         | NA         | NA         | NA         | NA         |
|  1 | 3 |      30 | 10         | 20         | NA         | NA         | NA         |
|  1 | 4 |      40 | 10         | 20         | 30         | NA         | NA         |
|  1 | 5 |      40 | 10         | 20         | 30         | 40         | NA         |
|  2 | 1 |      20 | NA         | NA         | NA         | NA         | NA         |
|  2 | 2 |      30 | 20         | NA         | NA         | NA         | NA         |
|  2 | 3 |      40 | 20         | 30         | NA         | NA         | NA         |
|  2 | 4 |      40 | 20         | 30         | 40         | NA         | NA         |
|  2 | 5 |      20 | 20         | 30         | 40         | 40         | NA         |
|  . | . |       . | .          | .          | .          | .          | .          |
|  . | . |       . | .          | .          | .          | .          | .          |
|  . | . |       . | .          | .          | .          | .          | .          |
+----+---+---------+------------+------------+------------+------------+------------+

我在该论坛上回答了大多数问题,但其中大多数解决了不适用于此情况的滞后列。

也许将mutatedplyr一起使用可能是一种简单而有效的方法,但是到目前为止,我无法使其正常工作。

4 个答案:

答案 0 :(得分:2)

我们可以使用data.table方法。将'data.frame'转换为'data.table'(setDT(df1)),按'id'分组,循环遍历'结果',rep将元素与{{ 1}}和1:.N,其中NA为填充,然后与“ id”和“ t”列上的原始数据集结合起来

.N:1

或带有library(data.table) df2 <- setDT(df1)[, Map(function(x, y, z) rep(c(NA, x), c(y, z)), outcome, 1:.N, .N:1), id][, t := rowid(id)] out <- df2[df1, on = .(id, t)] setcolorder(out, c(1, 7, 8, 2:6)) setnames(out, 4:ncol(out), paste0("outcome_t", 1:5)) out # id t outcome outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5 # 1: 1 1 10 NA NA NA NA NA # 2: 1 2 20 10 NA NA NA NA # 3: 1 3 30 10 20 NA NA NA # 4: 1 4 40 10 20 30 NA NA # 5: 1 5 40 10 20 30 40 NA # 6: 2 1 20 NA NA NA NA NA # 7: 2 2 30 20 NA NA NA NA # 8: 2 3 40 20 30 NA NA NA # 9: 2 4 40 20 30 40 NA NA #10: 2 5 20 20 30 40 40 NA 的选项

dcast

或者我们可以更紧凑地做到这一点

dcast(setDT(df1), id + t ~ paste0("outcome_t", t), 
       value.var = 'outcome')[, na.locf(.SD, na.rm = FALSE), id]

或使用library(zoo) nm1 <- paste0("outcome_t", 1:5) df1[nm1] <- do.call(rbind, lapply(split(df1$outcome, df1$id), function(x) head(rbind(NA, na.locf((NA^!diag(x)) * x)), -1)))

colCumsums

数据

library(matrixStats)
df1[nm1] <- do.call(rbind, lapply(split(df1$outcome, df1$id), 
          function(x) colCumsums(rbind(0, diag(x)))[-length(x), ]))

答案 1 :(得分:2)

基于R的方法,我们可以基于split outcome id列并创建一个数据帧,该数据帧每次在outcome变量中添加一个值,并填充其余的值它们分别与NA和最后rbind的这些数据帧列表合并为一个数据帧。

n <- 5
df[paste0("outcome_t", seq_len(n))] <- do.call(rbind, 
    lapply(split(df$outcome, df$id), function(x) 
  t(sapply(seq_along(x), function(y) c(x[seq_len(y - 1)], rep(NA, n - (y - 1)))))))

df
#   id t outcome outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
#1   1 1      10         NA         NA         NA         NA         NA
#2   1 2      20         10         NA         NA         NA         NA
#3   1 3      30         10         20         NA         NA         NA
#4   1 4      40         10         20         30         NA         NA
#5   1 5      40         10         20         30         40         NA
#6   2 1      20         NA         NA         NA         NA         NA
#7   2 2      30         20         NA         NA         NA         NA
#8   2 3      40         20         30         NA         NA         NA
#9   2 4      40         20         30         40         NA         NA
#10  2 5      20         20         30         40         40         NA

使用tidyverse的{​​{1}}选项

separate

数据

library(tidyverse)

df %>%
   group_by(id) %>%
   mutate(new = map_chr(seq_along(outcome), 
         ~paste0(outcome[seq_len(. - 1)], collapse = ","))) %>%
   separate(new, into = paste0("outcome_t", seq_len(n)), 
                 sep = ",", fill = "right") %>%
   mutate(outcome_t1 = replace(outcome_t1, outcome_t1 == "", NA))

答案 2 :(得分:2)

使用transpose的另一种data.table方法:

DT[, paste0("outcome_t", 1:5) := 
        transpose(lapply(t, function(x) replace(outcome, t>=x, NA))), 
    by=.(id)]

输出:

    id t outcome outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
 1:  1 1      10         NA         NA         NA         NA         NA
 2:  1 2      20         10         NA         NA         NA         NA
 3:  1 3      30         10         20         NA         NA         NA
 4:  1 4      40         10         20         30         NA         NA
 5:  1 5      40         10         20         30         40         NA
 6:  2 1      20         NA         NA         NA         NA         NA
 7:  2 2      30         20         NA         NA         NA         NA
 8:  2 3      40         20         30         NA         NA         NA
 9:  2 4      40         20         30         40         NA         NA
10:  2 5      20         20         30         40         40         NA

数据:

library(data.table)
DT <- fread("| id | t | outcome |
|  1 | 1 |      10 |
|  1 | 2 |      20 |
|  1 | 3 |      30 |
|  1 | 4 |      40 |
|  1 | 5 |      40 |
|  2 | 1 |      20 |
|  2 | 2 |      30 |
|  2 | 3 |      40 |
|  2 | 4 |      40 |
|  2 | 5 |      20 |")[, c(-1,-5)]

答案 3 :(得分:2)

这是一种tidyverse的方法。

library(tidyverse)

df %>% 
  mutate(rn = 1:n(),
         t = paste0("outcome_t", t)) %>%
  group_by(id) %>%
  spread(t, outcome) %>%
  mutate_at(vars(-rn, -id), lag) %>%
  fill(-rn, -id)

# A tibble: 10 x 7
# Groups:   id [2]
      id    rn outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
   <int> <int>      <int>      <int>      <int>      <int>      <int>
 1     1     1         NA         NA         NA         NA         NA
 2     1     2         10         NA         NA         NA         NA
 3     1     3         10         20         NA         NA         NA
 4     1     4         10         20         30         NA         NA
 5     1     5         10         20         30         40         NA
 6     2     6         NA         NA         NA         NA         NA
 7     2     7         20         NA         NA         NA         NA
 8     2     8         20         30         NA         NA         NA
 9     2     9         20         30         40         NA         NA
10     2    10         20         30         40         40         NA