连续NA个数

时间:2018-08-14 23:59:59

标签: r tidyverse missing-data

数据是这样的

subject x1   x2   x3   x4   x5   x6   x7        
a       0.1  NA   0.2  0.1  0.1  NA   0.9        
b       NA   NA  -0.01 NA   0.3  0.8  0.01
c       NA   NA   NA   NA   NA   0.9  0.4
d       NA   NA  0.01  NA   NA   NA   0.05

如何在此data.frame中附加新变量“ MAX NA的最大数目”?

subject x1   x2   x3   x4   x5   x6   x7    NA_consecutive    
a       0.1  NA   0.2  0.1  0.1  NA   0.9        1
b       NA   NA  -0.01 NA   0.3  0.8  0.01       2 (max NA, not 1!!)
c       NA   NA   NA   NA   NA   0.9  0.4        5
d       NA   NA  0.01  NA   NA   NA   0.05       3 (max NA, not 2!!)

我想按主题(即一行)计算连续NA的数量。 简而言之,我尝试使用duplicate,但是它显示出重复的所有内容,包括正常值,而不是NA。

如果我将此数据集转换为“长”,则df %>% gather(variable, value, -subject)

   subject variable  value
 1 a       x1         0.1 
 2 a       x2         NA   
 3 a       x3         0.2 
 4 a       x4         0.1 
 5 a       x5         0.1 
 6 a       x6         NA   
 7 a       x7         0.9 
 8 b       x1         NA   
 9 b       x2         NA   
10 b       x3        -0.01
..

这种形式更容易吗?

我不在乎任何形式的表格,我应该获取新信息(最大连续NA)。

如果可能,请避免“ for循环”(但不要完全避免),因为此数据非常大。

3 个答案:

答案 0 :(得分:2)

这是一个tidyverse选项

df %>%
    gather(k, v, -subject) %>%
    arrange(subject, k) %>%
    group_by(subject) %>%
    mutate(grp = cumsum(c(0, abs(diff(!is.na(v))) == 1))) %>%
    add_count(subject, grp) %>%
    mutate(NA_consecutive = max(n[is.na(v)])) %>%
    select(-grp, -n) %>%
    spread(k, v)
## A tibble: 4 x 9
## Groups:   subject [4]
#  subject NA_consecutive     x1    x2       x3     x4     x5     x6     x7
#  <fct>            <int>  <dbl> <dbl>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#1 a                    1  0.100    NA   0.200   0.100  0.100 NA     0.900
#2 b                    2 NA        NA  -0.0100 NA      0.300  0.800 0.0100
#3 c                    5 NA        NA  NA      NA     NA      0.900 0.400
#4 d                    3 NA        NA   0.0100 NA     NA     NA     0.0500

答案 1 :(得分:1)

以下是使用data.table的建议解决方案。如果OP只需要tidyverse解决方案,我将删除它:

#count number of consecutive NAs by converting into long format and 
#using rle to count consective NAs and then extract longest length
consecNA <- melt(dat, id.vars="subject")[, {
        r <- rle(is.na(value))
        max(r$lengths[r$values])
    }, by=.(subject)]

#perform an update join (i.e. a lookup)
dat[consecNA, NA_consecutive := V1, on=.(subject)]
dat

另一种可能的方法是:

dat[, NA_cons := apply(.SD, 1, function(x) {
        r <- rle(is.na(x))
        max(r$lengths[r$values])
    }), by=.(subject)]

或等效地在基数R中:

dat$NA_cons <- apply(dat[, paste0("x", 1:7)], 1, function(x) {
        r <- rle(is.na(x))
        max(r$lengths[r$values])
    })

数据:

library(data.table)    
dat <- fread("subject x1   x2   x3   x4   x5   x6   x7        
a       0.1  NA   0.2  0.1  0.1  NA   0.9        
b       NA   NA  -0.01 NA   0.3  0.8  0.01
c       NA   NA   NA   NA   NA   0.9  0.4
d       NA   NA  0.01  NA   NA   NA   0.05")
cols <- paste0("x", 1:7)
dat[, (cols) := lapply(.SD, as.numeric), .SDcols=cols]

答案 2 :(得分:0)

df$NA_consecutive <- apply(df[-1], 1, function(x) max(rle(is.na(x))$lengths[rle(is.na(x))$values]))

df
# # A tibble: 4 x 9
#   subject    x1 x2        x3    x4    x5    x6    x7 NA_consecutive
#   <chr>   <dbl> <lgl>  <dbl> <dbl> <dbl> <dbl> <dbl>          <int>
# 1 a         0.1 NA      0.2    0.1   0.1  NA    0.9               1
# 2 b        NA   NA     -0.01  NA     0.3   0.8  0.01              2
# 3 c        NA   NA     NA     NA    NA     0.9  0.4               5
# 4 d        NA   NA      0.01  NA    NA    NA    0.05              3

数据:

df <- data.frame(
  subject = c("a", "b", "c", "d"),
  x1 = c(.1, rep(NA, 3)),
  x2 = rep(NA, 4),
  x3 = c(.2, -.01, NA, .01),
  x4 = c(.1, rep(NA, 3)),
  x5 = c(.1, .3, NA, NA),
  x6 = c(NA, .8, .9, NA), 
  x7 = c(.9, .01, .4, .05)
)