在因子组合中添加缺失的行

时间:2017-06-09 13:19:21

标签: r data.table dplyr

我的数据框可能最接近:

library(data.table)
z <- rep("z",5)
y <- c(rep("st",2),rep("co",2),"fu")
var1 <- c(rep("a",2),rep("b",2),"c")
var2 <- c("y","y","y","z","x")
transp <- c("bus","plane","train","bus","bus")
sample1 <- sample(1:10, 5)
sample2 <- sample(1:10, 5)
df <- cbind(z,y,var1,var2,transp,sample1,sample2)
df<-as.data.table(df)
> df
   z  y var1 var2 transp sample1 sample2
1: z st    a    y    bus       4       3
2: z st    a    y  plane      10       7
3: z co    b    y  train       8       9
4: z co    b    z    bus       1       5
5: z fu    c    x    bus       6       4

表中已存在var1和var2的所有唯一组合。我想扩展表,以便var1 / var2的所有组合包括列表中的所有transp选项:

transtype <- c("bus","train")

注意“plane”是df中的一个选项,但不是transtype中的选项。我想保留包含transp =“plane”的行,但不要通过添加“plane”行来扩展。列z和y需要用适当的值填充,sample1和sample2应该是NA。结果应该是:

    > result
   z  y var1 var2 transp sample1 sample2
1: z st    a    y    bus       4       3
2: z st    a    y  plane      10       7
3: z st    a    y  train      NA      NA
4: z co    b    y  train       8       9
5: z co    b    y    bus      NA      NA
6: z co    b    z    bus       1       5
7: z co    b    z  train      NA      NA
8: z fu    c    x    bus       6       4
9: z fu    c    x  train      NA      NA

我基于Fastest way to add rows for missing values in a data.frame?Data.table: Add rows for missing combinations of 2 factors without losing associated descriptive factors提出的data.table选项最终扩展了var1和var2的所有唯一组合,而不仅仅是表中已存在的组合。我不知道如何保持z和y的值。像这样:

setkey(df, var1, var2, transp)
x<-df[CJ(var1, var2, transp, unique=T)]

也许我应该使用dplyr?或者也许我错过了一些简单的东西?我查看了data.table文档,无法提出解决方案。

2 个答案:

答案 0 :(得分:4)

以下是使用dplyrtidyr的解决方案,尤其是tidyr::completetidyr::nesting。后者对于完成在数据集中使用组合非常有用,而complete将为您提供所有组合。

library(dplyr)
library(tidyr)
df %>% 
  filter(transp %in% transtype)  %>%
  complete(nesting(z, y, var1, var2), transp) %>%
  union(df)
# A tibble: 9 <U+00D7> 7
      z     y  var1  var2 transp sample1 sample2
  <chr> <chr> <chr> <chr>  <chr>   <chr>   <chr>
1     z    st     a     y  plane      10      10
2     z    st     a     y  train    <NA>    <NA>
3     z    st     a     y    bus       1       9
4     z    fu     c     x  train    <NA>    <NA>
5     z    fu     c     x    bus       5       3
6     z    co     b     z  train    <NA>    <NA>
7     z    co     b     z    bus       6       6
8     z    co     b     y  train       3       2
9     z    co     b     y    bus    <NA>    <NA>

答案 1 :(得分:3)

要仅获取df中已存在的唯一组合,最好使用by创建新的引用data.table,然后将其与原始引用合并。

使用:

df2 <- df[, .(transp = transtype), by = .(var1,var2)]
merge(df, df2, by = c('var1','var2','transp'), all = TRUE)

给出:

   var1 var2 transp  z  y sample1 sample2
1:    a    y    bus  z st       4       3
2:    a    y  plane  z st      10       7
3:    a    y  train NA NA      NA      NA
4:    b    y    bus NA NA      NA      NA
5:    b    y  train  z co       8       9
6:    b    z    bus  z co       1       5
7:    b    z  train NA NA      NA      NA
8:    c    x    bus  z fu       6       4
9:    c    x  train NA NA      NA      NA

如果zy列不具有NA - 值,则可以执行以下操作:

df2 <- df[, .(transp = transtype), by = .(var1,var2,z,y)]
merge(df, df2, by = c('var1','var2','transp','z','y'), all = TRUE)

给出:

   var1 var2 transp z  y sample1 sample2
1:    a    y    bus z st       4       3
2:    a    y  plane z st      10       7
3:    a    y  train z st      NA      NA
4:    b    y    bus z co      NA      NA
5:    b    y  train z co       8       9
6:    b    z    bus z co       1       5
7:    b    z  train z co      NA      NA
8:    c    x    bus z fu       6       4
9:    c    x  train z fu      NA      NA

注意:如果zy列的每个var1 / var2组合都有多个唯一值,则最好使用na.locf来自zoo包,以填充NAz列中的y - 值。

使用过的数据:

df <- fread("z  y var1 var2 transp sample1 sample2
 z st    a    y    bus       4       3
 z st    a    y  plane      10       7
 z co    b    y  train       8       9
 z co    b    z    bus       1       5
 z fu    c    x    bus       6       4")