data.table按组填充其他行的缺失值

时间:2018-02-20 22:55:57

标签: r data.table row na

# have
> aDT <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA))
> aDT
    colA colB
 1:    1    4
 2:    1   NA
 3:    1   NA
 4:    1    1
 5:    2    4
 6:    2    3
 7:    2   NA
 8:    2   NA
 9:    3    4
10:    3   NA
11:    3    2
12:    3   NA
# want
> bDT <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,1,1,1,4,3,3,3,4,2,2,2))
> bDT
    colA colB
 1:    1    4
 2:    1    1
 3:    1    1
 4:    1    1
 5:    2    4
 6:    2    3
 7:    2    3
 8:    2    3
 9:    3    4
10:    3    2
11:    3    2
12:    3    2

想根据以下算法填充缺失值: 在每个组内('colA'),

  1. 使用下面一行中的值,如果它仍然是NA,则一直持续到该组中的最后一行
  2. 如果下面的行中有所有NA,请查看上面的行(一次向上一行)
  3. 如果所有的NA,那么NA
  4. 由于数据集非常大,算法效率是考虑因素的一部分。不确定是否已有任何此类操作的包。怎么做?

3 个答案:

答案 0 :(得分:3)

data.tablezoo

library(data.table)
library(zoo)

# Last observation carried forward from last row of group
dt <- dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA]

# Last observation carried forward for first row of group
dt[, colB := na.locf(colB), by = colA][]

或在一个链中:

dt[, colB := na.locf0(colB, fromLast = TRUE), by = colA][
   , colB := na.locf(colB), by = colA][]

两者都返回:

    colA colB
 1:    1    4
 2:    1    1
 3:    1    1
 4:    1    1
 5:    2    4
 6:    2    3
 7:    2    3
 8:    2    3
 9:    3    4
10:    3    2
11:    3    2
12:    3    2

数据:

text <- "colA colB
    1    4
    1   NA
    1   NA
    1    1
    2    4
    2    3
    2   NA
    2   NA
    3    4
    3   NA
    3    2
    3   NA"

dt <- fread(input = text, stringsAsFactors = FALSE)

答案 1 :(得分:2)

以下是使用tidyversezoo::na.locf的一种方式:

library(tidyverse);
library(zoo);
df %>%
    group_by(colA) %>%
    arrange(colA) %>%
    mutate(colB = na.locf(colB, na.rm = F, fromLast = TRUE)) %>%
    mutate(colB = na.locf(colB, na.rm = F));
## A tibble: 12 x 2
## Groups:   colA [3]
#    colA  colB
#   <dbl> <dbl>
# 1  1.00  4.00
# 2  1.00  1.00
# 3  1.00  1.00
# 4  1.00  1.00
# 5  2.00  4.00
# 6  2.00  3.00
# 7  2.00  3.00
# 8  2.00  3.00
# 9  3.00  4.00
#10  3.00  2.00
#11  3.00  2.00
#12  3.00  2.00

data.table方式:

library(data.table);
dt[, .(na.locf(na.locf(colB, na.rm = F, fromLast = T), na.rm = F)), by = .(colA)];
#    colA V1
# 1:    1  4
# 2:    1  1
# 3:    1  1
# 4:    1  1
# 5:    2  4
# 6:    2  3
# 7:    2  3
# 8:    2  3
# 9:    3  4
#10:    3  2
#11:    3  2
#12:    3  2

两种情况下的关键是应用na.locf两次:首先从底部替换NA,然后从顶部替换剩余的NA

样本数据

# As data.frame
df <- data.frame(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA));
# As data.table
dt <- data.table(colA = c(1,1,1,1,2,2,2,2,3,3,3,3), colB = c(4,NA,NA,1,4,3,NA,NA,4,NA,2,NA));

答案 2 :(得分:1)

library(tidyverse)

aDT%>%group_by(colA)%>%fill(colB,.direction="up")%>%fill(colB)
# A tibble: 12 x 2
# Groups:   colA [3]
    colA  colB
   <dbl> <dbl>
 1     1     4
 2     1     1
 3     1     1
 4     1     1
 5     2     4
 6     2     3
 7     2     3
 8     2     3
 9     3     4
10     3     2
11     3     2
12     3     2