我的数据框可能最接近:
library(data.table)
z <- rep("z",5)
y <- c(rep("st",2),rep("co",2),"fu")
var1 <- c(rep("a",2),rep("b",2),"c")
var2 <- c("y","y","y","z","x")
transp <- c("bus","plane","train","bus","bus")
sample1 <- sample(1:10, 5)
sample2 <- sample(1:10, 5)
df <- cbind(z,y,var1,var2,transp,sample1,sample2)
df<-as.data.table(df)
> df
z y var1 var2 transp sample1 sample2
1: z st a y bus 4 3
2: z st a y plane 10 7
3: z co b y train 8 9
4: z co b z bus 1 5
5: z fu c x bus 6 4
表中已存在var1和var2的所有唯一组合。我想扩展表,以便var1 / var2的所有组合包括列表中的所有transp选项:
transtype <- c("bus","train")
注意“plane”是df中的一个选项,但不是transtype中的选项。我想保留包含transp =“plane”的行,但不要通过添加“plane”行来扩展。列z和y需要用适当的值填充,sample1和sample2应该是NA。结果应该是:
> result
z y var1 var2 transp sample1 sample2
1: z st a y bus 4 3
2: z st a y plane 10 7
3: z st a y train NA NA
4: z co b y train 8 9
5: z co b y bus NA NA
6: z co b z bus 1 5
7: z co b z train NA NA
8: z fu c x bus 6 4
9: z fu c x train NA NA
我基于Fastest way to add rows for missing values in a data.frame?和Data.table: Add rows for missing combinations of 2 factors without losing associated descriptive factors提出的data.table选项最终扩展了var1和var2的所有唯一组合,而不仅仅是表中已存在的组合。我不知道如何保持z和y的值。像这样:
setkey(df, var1, var2, transp)
x<-df[CJ(var1, var2, transp, unique=T)]
也许我应该使用dplyr?或者也许我错过了一些简单的东西?我查看了data.table文档,无法提出解决方案。
答案 0 :(得分:4)
以下是使用dplyr
和tidyr
的解决方案,尤其是tidyr::complete
和tidyr::nesting
。后者对于完成在数据集中使用组合非常有用,而complete
将为您提供所有组合。
library(dplyr)
library(tidyr)
df %>%
filter(transp %in% transtype) %>%
complete(nesting(z, y, var1, var2), transp) %>%
union(df)
# A tibble: 9 <U+00D7> 7
z y var1 var2 transp sample1 sample2
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 z st a y plane 10 10
2 z st a y train <NA> <NA>
3 z st a y bus 1 9
4 z fu c x train <NA> <NA>
5 z fu c x bus 5 3
6 z co b z train <NA> <NA>
7 z co b z bus 6 6
8 z co b y train 3 2
9 z co b y bus <NA> <NA>
答案 1 :(得分:3)
要仅获取df
中已存在的唯一组合,最好使用by
创建新的引用data.table,然后将其与原始引用合并。
使用:
df2 <- df[, .(transp = transtype), by = .(var1,var2)]
merge(df, df2, by = c('var1','var2','transp'), all = TRUE)
给出:
var1 var2 transp z y sample1 sample2 1: a y bus z st 4 3 2: a y plane z st 10 7 3: a y train NA NA NA NA 4: b y bus NA NA NA NA 5: b y train z co 8 9 6: b z bus z co 1 5 7: b z train NA NA NA NA 8: c x bus z fu 6 4 9: c x train NA NA NA NA
如果z
和y
列不具有NA
- 值,则可以执行以下操作:
df2 <- df[, .(transp = transtype), by = .(var1,var2,z,y)]
merge(df, df2, by = c('var1','var2','transp','z','y'), all = TRUE)
给出:
var1 var2 transp z y sample1 sample2 1: a y bus z st 4 3 2: a y plane z st 10 7 3: a y train z st NA NA 4: b y bus z co NA NA 5: b y train z co 8 9 6: b z bus z co 1 5 7: b z train z co NA NA 8: c x bus z fu 6 4 9: c x train z fu NA NA
注意:如果z
和y
列的每个var1
/ var2
组合都有多个唯一值,则最好使用na.locf
来自zoo
包,以填充NA
和z
列中的y
- 值。
使用过的数据:
df <- fread("z y var1 var2 transp sample1 sample2
z st a y bus 4 3
z st a y plane 10 7
z co b y train 8 9
z co b z bus 1 5
z fu c x bus 6 4")