假设要求受访者(id)在五个任务中进行选择(t = 1,2,3,4,5)(一个面板数据集,每个受访者具有五个观察值)。一旦做出选择,然后将结果显示给受访者。假设数据如下所示。
+----+---+---------+
| id | t | outcome |
+----+---+---------+
| 1 | 1 | 10 |
| 1 | 2 | 20 |
| 1 | 3 | 30 |
| 1 | 4 | 40 |
| 1 | 5 | 40 |
| 2 | 1 | 20 |
| 2 | 2 | 30 |
| 2 | 3 | 40 |
| 2 | 4 | 40 |
| 2 | 5 | 20 |
| . | . | . |
| . | . | . |
| . | . | . |
+----+---+---------+
现在,我有兴趣保留每个 t-1 任务的结果变量的历史记录。我的目标是以下输出。
+----+---+---------+------------+------------+------------+------------+------------+
| id | t | outcome | outcome_t1 | outcome_t2 | outcome_t3 | outcome_t4 | outcome_t5 |
+----+---+---------+------------+------------+------------+------------+------------+
| 1 | 1 | 10 | NA | NA | NA | NA | NA |
| 1 | 2 | 20 | 10 | NA | NA | NA | NA |
| 1 | 3 | 30 | 10 | 20 | NA | NA | NA |
| 1 | 4 | 40 | 10 | 20 | 30 | NA | NA |
| 1 | 5 | 40 | 10 | 20 | 30 | 40 | NA |
| 2 | 1 | 20 | NA | NA | NA | NA | NA |
| 2 | 2 | 30 | 20 | NA | NA | NA | NA |
| 2 | 3 | 40 | 20 | 30 | NA | NA | NA |
| 2 | 4 | 40 | 20 | 30 | 40 | NA | NA |
| 2 | 5 | 20 | 20 | 30 | 40 | 40 | NA |
| . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . |
| . | . | . | . | . | . | . | . |
+----+---+---------+------------+------------+------------+------------+------------+
我在该论坛上回答了大多数问题,但其中大多数解决了不适用于此情况的滞后列。
也许将mutate
与dplyr
一起使用可能是一种简单而有效的方法,但是到目前为止,我无法使其正常工作。
答案 0 :(得分:2)
我们可以使用data.table
方法。将'data.frame'转换为'data.table'(setDT(df1)
),按'id'分组,循环遍历'结果',rep
将元素与{{ 1}}和1:.N
,其中NA为填充,然后与“ id”和“ t”列上的原始数据集结合起来
.N:1
或带有library(data.table)
df2 <- setDT(df1)[, Map(function(x, y, z) rep(c(NA, x),
c(y, z)), outcome, 1:.N, .N:1), id][, t := rowid(id)]
out <- df2[df1, on = .(id, t)]
setcolorder(out, c(1, 7, 8, 2:6))
setnames(out, 4:ncol(out), paste0("outcome_t", 1:5))
out
# id t outcome outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
# 1: 1 1 10 NA NA NA NA NA
# 2: 1 2 20 10 NA NA NA NA
# 3: 1 3 30 10 20 NA NA NA
# 4: 1 4 40 10 20 30 NA NA
# 5: 1 5 40 10 20 30 40 NA
# 6: 2 1 20 NA NA NA NA NA
# 7: 2 2 30 20 NA NA NA NA
# 8: 2 3 40 20 30 NA NA NA
# 9: 2 4 40 20 30 40 NA NA
#10: 2 5 20 20 30 40 40 NA
的选项
dcast
或者我们可以更紧凑地做到这一点
dcast(setDT(df1), id + t ~ paste0("outcome_t", t),
value.var = 'outcome')[, na.locf(.SD, na.rm = FALSE), id]
或使用library(zoo)
nm1 <- paste0("outcome_t", 1:5)
df1[nm1] <- do.call(rbind, lapply(split(df1$outcome, df1$id),
function(x) head(rbind(NA, na.locf((NA^!diag(x)) * x)), -1)))
colCumsums
library(matrixStats)
df1[nm1] <- do.call(rbind, lapply(split(df1$outcome, df1$id),
function(x) colCumsums(rbind(0, diag(x)))[-length(x), ]))
答案 1 :(得分:2)
基于R的方法,我们可以基于split
outcome
id
列并创建一个数据帧,该数据帧每次在outcome
变量中添加一个值,并填充其余的值它们分别与NA
和最后rbind
的这些数据帧列表合并为一个数据帧。
n <- 5
df[paste0("outcome_t", seq_len(n))] <- do.call(rbind,
lapply(split(df$outcome, df$id), function(x)
t(sapply(seq_along(x), function(y) c(x[seq_len(y - 1)], rep(NA, n - (y - 1)))))))
df
# id t outcome outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
#1 1 1 10 NA NA NA NA NA
#2 1 2 20 10 NA NA NA NA
#3 1 3 30 10 20 NA NA NA
#4 1 4 40 10 20 30 NA NA
#5 1 5 40 10 20 30 40 NA
#6 2 1 20 NA NA NA NA NA
#7 2 2 30 20 NA NA NA NA
#8 2 3 40 20 30 NA NA NA
#9 2 4 40 20 30 40 NA NA
#10 2 5 20 20 30 40 40 NA
使用tidyverse
的{{1}}选项
separate
数据
library(tidyverse)
df %>%
group_by(id) %>%
mutate(new = map_chr(seq_along(outcome),
~paste0(outcome[seq_len(. - 1)], collapse = ","))) %>%
separate(new, into = paste0("outcome_t", seq_len(n)),
sep = ",", fill = "right") %>%
mutate(outcome_t1 = replace(outcome_t1, outcome_t1 == "", NA))
答案 2 :(得分:2)
使用transpose
的另一种data.table方法:
DT[, paste0("outcome_t", 1:5) :=
transpose(lapply(t, function(x) replace(outcome, t>=x, NA))),
by=.(id)]
输出:
id t outcome outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
1: 1 1 10 NA NA NA NA NA
2: 1 2 20 10 NA NA NA NA
3: 1 3 30 10 20 NA NA NA
4: 1 4 40 10 20 30 NA NA
5: 1 5 40 10 20 30 40 NA
6: 2 1 20 NA NA NA NA NA
7: 2 2 30 20 NA NA NA NA
8: 2 3 40 20 30 NA NA NA
9: 2 4 40 20 30 40 NA NA
10: 2 5 20 20 30 40 40 NA
数据:
library(data.table)
DT <- fread("| id | t | outcome |
| 1 | 1 | 10 |
| 1 | 2 | 20 |
| 1 | 3 | 30 |
| 1 | 4 | 40 |
| 1 | 5 | 40 |
| 2 | 1 | 20 |
| 2 | 2 | 30 |
| 2 | 3 | 40 |
| 2 | 4 | 40 |
| 2 | 5 | 20 |")[, c(-1,-5)]
答案 3 :(得分:2)
这是一种tidyverse
的方法。
library(tidyverse)
df %>%
mutate(rn = 1:n(),
t = paste0("outcome_t", t)) %>%
group_by(id) %>%
spread(t, outcome) %>%
mutate_at(vars(-rn, -id), lag) %>%
fill(-rn, -id)
# A tibble: 10 x 7
# Groups: id [2]
id rn outcome_t1 outcome_t2 outcome_t3 outcome_t4 outcome_t5
<int> <int> <int> <int> <int> <int> <int>
1 1 1 NA NA NA NA NA
2 1 2 10 NA NA NA NA
3 1 3 10 20 NA NA NA
4 1 4 10 20 30 NA NA
5 1 5 10 20 30 40 NA
6 2 6 NA NA NA NA NA
7 2 7 20 NA NA NA NA
8 2 8 20 30 NA NA NA
9 2 9 20 30 40 NA NA
10 2 10 20 30 40 40 NA