数据是这样的
subject x1 x2 x3 x4 x5 x6 x7
a 0.1 NA 0.2 0.1 0.1 NA 0.9
b NA NA -0.01 NA 0.3 0.8 0.01
c NA NA NA NA NA 0.9 0.4
d NA NA 0.01 NA NA NA 0.05
如何在此data.frame中附加新变量“ MAX NA的最大数目”?
subject x1 x2 x3 x4 x5 x6 x7 NA_consecutive
a 0.1 NA 0.2 0.1 0.1 NA 0.9 1
b NA NA -0.01 NA 0.3 0.8 0.01 2 (max NA, not 1!!)
c NA NA NA NA NA 0.9 0.4 5
d NA NA 0.01 NA NA NA 0.05 3 (max NA, not 2!!)
我想按主题(即一行)计算连续NA的数量。
简而言之,我尝试使用duplicate
,但是它显示出重复的所有内容,包括正常值,而不是NA。
如果我将此数据集转换为“长”,则df %>% gather(variable, value, -subject)
subject variable value
1 a x1 0.1
2 a x2 NA
3 a x3 0.2
4 a x4 0.1
5 a x5 0.1
6 a x6 NA
7 a x7 0.9
8 b x1 NA
9 b x2 NA
10 b x3 -0.01
..
这种形式更容易吗?
我不在乎任何形式的表格,我应该获取新信息(最大连续NA)。
如果可能,请避免“ for循环”(但不要完全避免),因为此数据非常大。
答案 0 :(得分:2)
这是一个tidyverse
选项
df %>%
gather(k, v, -subject) %>%
arrange(subject, k) %>%
group_by(subject) %>%
mutate(grp = cumsum(c(0, abs(diff(!is.na(v))) == 1))) %>%
add_count(subject, grp) %>%
mutate(NA_consecutive = max(n[is.na(v)])) %>%
select(-grp, -n) %>%
spread(k, v)
## A tibble: 4 x 9
## Groups: subject [4]
# subject NA_consecutive x1 x2 x3 x4 x5 x6 x7
# <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 a 1 0.100 NA 0.200 0.100 0.100 NA 0.900
#2 b 2 NA NA -0.0100 NA 0.300 0.800 0.0100
#3 c 5 NA NA NA NA NA 0.900 0.400
#4 d 3 NA NA 0.0100 NA NA NA 0.0500
答案 1 :(得分:1)
以下是使用data.table
的建议解决方案。如果OP只需要tidyverse
解决方案,我将删除它:
#count number of consecutive NAs by converting into long format and
#using rle to count consective NAs and then extract longest length
consecNA <- melt(dat, id.vars="subject")[, {
r <- rle(is.na(value))
max(r$lengths[r$values])
}, by=.(subject)]
#perform an update join (i.e. a lookup)
dat[consecNA, NA_consecutive := V1, on=.(subject)]
dat
另一种可能的方法是:
dat[, NA_cons := apply(.SD, 1, function(x) {
r <- rle(is.na(x))
max(r$lengths[r$values])
}), by=.(subject)]
或等效地在基数R中:
dat$NA_cons <- apply(dat[, paste0("x", 1:7)], 1, function(x) {
r <- rle(is.na(x))
max(r$lengths[r$values])
})
数据:
library(data.table)
dat <- fread("subject x1 x2 x3 x4 x5 x6 x7
a 0.1 NA 0.2 0.1 0.1 NA 0.9
b NA NA -0.01 NA 0.3 0.8 0.01
c NA NA NA NA NA 0.9 0.4
d NA NA 0.01 NA NA NA 0.05")
cols <- paste0("x", 1:7)
dat[, (cols) := lapply(.SD, as.numeric), .SDcols=cols]
答案 2 :(得分:0)
df$NA_consecutive <- apply(df[-1], 1, function(x) max(rle(is.na(x))$lengths[rle(is.na(x))$values]))
df
# # A tibble: 4 x 9
# subject x1 x2 x3 x4 x5 x6 x7 NA_consecutive
# <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1 a 0.1 NA 0.2 0.1 0.1 NA 0.9 1
# 2 b NA NA -0.01 NA 0.3 0.8 0.01 2
# 3 c NA NA NA NA NA 0.9 0.4 5
# 4 d NA NA 0.01 NA NA NA 0.05 3
数据:
df <- data.frame(
subject = c("a", "b", "c", "d"),
x1 = c(.1, rep(NA, 3)),
x2 = rep(NA, 4),
x3 = c(.2, -.01, NA, .01),
x4 = c(.1, rep(NA, 3)),
x5 = c(.1, .3, NA, NA),
x6 = c(NA, .8, .9, NA),
x7 = c(.9, .01, .4, .05)
)