简要数据集说明:我有从Qualtrics生成的调查数据,我已将其导入R作为tibble。每列对应一个调查问题,我保留了原始列顺序(与调查中问题的顺序一致)。
用简单的语言表示问题:由于参与者的磨损正常,并非所有参与者都完成了调查中的所有问题。我想知道每个参与者在调查中得到了多少,以及他们在停止之前回答的最后一个问题。
R中的问题陈述:我想生成(使用tidyverse):
示例数据框df
df <- tibble(
year = c(2015, 2015, 2016, 2016),
grade = c(1, NA, 1, NA),
height = c("short", "tall", NA, NA),
gender = c(NA, "m", NA, "f")
)
原创df
# A tibble: 4 x 4
year grade height gender
<dbl> <dbl> <chr> <chr>
1 2015 1 short <NA>
2 2015 NA tall m
3 2016 1 <NA> <NA>
4 2016 NA <NA> f
所需的最终df
# A tibble: 4 x 6
year grade height gender lastq lastqnum
<dbl> <dbl> <chr> <chr> <chr> <dbl>
1 2015 1 short <NA> height 3
2 2015 NA tall m gender 4
3 2016 1 <NA> <NA> grade 2
4 2016 NA <NA> f gender 4
还有其他一些相关的问题,但我似乎找不到任何重点是根据混合变量类的类型提取列名(与the values themselves相比)(vs. {{ 3}}),使用tidyverse解决方案
我一直在尝试 - 我知道我在这里找不到的东西......:
ds %>% map(which(!is.na(.)))
ds %>%
map(tail(!is.na(.), 2))
ds %>%
rowwise() %>%
mutate(last = which(!is.na(ds)))
非常感谢你的帮助!
答案 0 :(得分:1)
Write a function that solves the problem, following James' suggestion but a little more robust (handles the case when all answers are NA)
f0 = function(df) {
idx = ifelse(is.na(df), 0L, col(df))
apply(idx, 1, max)
}
The L
makes the 0 an integer, rather than numeric. For a speed improvement (when there are many rows), use the matrixStats package
f1 = function(df) {
idx = ifelse(is.na(df), 0L, col(df))
matrixStats::rowMaxs(idx, na.rm=TRUE)
}
Follow markus' suggestion to use this in a dplyr context
mutate(df, lastqnum = f1(df), lastq = c(NA, names(df))[lastqnum + 1])
df %>% mutate(lastqnum = f1(.), lastq = c(NA, names(.))[lastqnum + 1])
or just do it
lastqnum = f1(df)
cbind(df, lastq=c(NA, names(df))[lastqnum + 1], lastqnum)
Edited after acceptance I guess the tidy approach would be first to tidy the data into long form
df1 = cbind(gather(df), id = as.vector(row(df)), event = as.vector(col(df)))
and then to group and summarize
group_by(df1, id) %>%
summarize(lastq = tail(event[!is.na(value)], 1), lastqname = key[lastq])
This doesn't handle the case when here are no answers.