根据多个键控列将缺少的行添加到data.table

时间:2014-12-09 05:26:13

标签: r merge data.table cross-join

我有一个data.table对象,其中包含多个指定唯一案例的列。在下面的小示例中,变量" name"," job"和" sex"指定唯一ID。我想添加缺失的行,以便每个案例对另一个变量的每个可能实例都有一行," from" (类似于expand.grid)。

library(data.table)
set.seed(1)
mydata <- data.table(name = c("john","john","john","john","mary","chris","chris","chris"),
                 job = c("teacher","teacher","teacher","teacher","police","lawyer","lawyer","doctor"),
                 sex = c("male","male","male","male","female","female","male","male"),
                 from = c("NYT","USAT","BG","TIME","USAT","BG","NYT","NYT"),
                 score = rnorm(8))

setkeyv(mydata, cols=c("name","job","sex"))

mydata[CJ(unique(name, job, sex), unique(from))]

这是当前的data.table对象:

> mydata
    name     job    sex from      score
1:  john teacher   male  NYT -0.6264538
2:  john teacher   male USAT  0.1836433
3:  john teacher   male   BG -0.8356286
4:  john teacher   male TIME  1.5952808
5:  mary  police female USAT  0.3295078
6: chris  lawyer female   BG -0.8204684
7: chris  lawyer   male  NYT  0.4874291
8: chris  doctor   male  NYT  0.7383247

以下是我想要的结果:

> mydata
     name     job    sex from      score
1:   john teacher   male  NYT -0.6264538
2:   john teacher   male USAT  0.1836433
3:   john teacher   male   BG -0.8356286
4:   john teacher   male TIME  1.5952808
5:   mary  police female  NYT  NA
6:   mary  police female USAT  0.3295078
7:   mary  police female   BG  NA
8:   mary  police female TIME  NA
9:  chris  lawyer female  NYT -NA
10: chris  lawyer female USAT -NA
11: chris  lawyer female   BG -0.8204684
12: chris  lawyer female TIME -NA
13: chris  lawyer   male  NYT  0.4874291
14: chris  lawyer   male USAT  NA
15: chris  lawyer   male   BG  NA
16: chris  lawyer   male TIME  NA
17: chris  doctor   male  NYT  0.7383247
18: chris  doctor   male USAT  NA
19: chris  doctor   male   BG  NA
20: chris  doctor   male TIME  NA

以下是我尝试的内容:

setkeyv(mydata, cols=c("name","job","sex"))
mydata[CJ(unique(name, job, sex), unique(from))]

但是我收到以下错误并添加fromLast = TRUE(或FALSE)并没有给我正确的解决方案:

Error in unique.default(name, job, sex) : 
  'fromLast' must be TRUE or FALSE

以下是我遇到的相关答案(但似乎没有一个处理多个键控列): add missing rows to a data table

Efficiently inserting default missing rows in a data.table

Fastest way to add rows for missing values in a data.frame?

3 个答案:

答案 0 :(得分:4)

这里有几种可能性 - https://github.com/Rdatatable/data.table/pull/814

CJ.dt = function(...) {
  rows = do.call(CJ, lapply(list(...), function(x) if(is.data.frame(x)) seq_len(nrow(x)) else seq_along(x)));
  do.call(data.table, Map(function(x, y) x[y], list(...), rows))
}

setkey(mydata, name, job, sex, from)

mydata[CJ.dt(unique(data.table(name, job, sex)), unique(from))]
#     name     job    sex from      score
# 1: chris  doctor   male  NYT  0.7383247
# 2: chris  doctor   male   BG         NA
# 3: chris  doctor   male TIME         NA
# 4: chris  doctor   male USAT         NA
# 5: chris  lawyer female  NYT         NA
# 6: chris  lawyer female   BG -0.8204684
# 7: chris  lawyer female TIME         NA
# 8: chris  lawyer female USAT         NA
# 9: chris  lawyer   male  NYT  0.4874291
#10: chris  lawyer   male   BG         NA
#11: chris  lawyer   male TIME         NA
#12: chris  lawyer   male USAT         NA
#13:  john teacher   male  NYT -0.6264538
#14:  john teacher   male   BG -0.8356286
#15:  john teacher   male TIME  1.5952808
#16:  john teacher   male USAT  0.1836433
#17:  mary  police female  NYT         NA
#18:  mary  police female   BG         NA
#19:  mary  police female TIME         NA
#20:  mary  police female USAT  0.3295078

答案 1 :(得分:4)

tidyr的dev版本现在有一种优雅的方式来执行此操作,因为expand()函数现在支持嵌套和交叉:

library(dplyr)

mydata <- data_frame(
  name = c("john","john","john","john","mary","chris","chris","chris"),
  job = c("teacher","teacher","teacher","teacher","police","lawyer","lawyer","doctor"),
  sex = c("male","male","male","male","female","female","male","male"),
  from = c("NYT","USAT","BG","TIME","USAT","BG","NYT","NYT"),
  score = rnorm(8)
)

mydata %>% 
  expand(c(name, job, sex), from) %>% 
  left_join(mydata)

#> Joining by: c("name", "job", "sex", "from")
#> Source: local data frame [20 x 5]
#> 
#>     name     job    sex from      score
#> 1  chris  doctor   male   BG         NA
#> 2  chris  doctor   male  NYT  0.5448206
#> 3  chris  doctor   male TIME         NA
#> 4  chris  doctor   male USAT         NA
#> 5  chris  lawyer female   BG  1.2015173
#> 6  chris  lawyer female  NYT         NA
#> 7  chris  lawyer female TIME         NA
#> 8  chris  lawyer female USAT         NA
#> 9  chris  lawyer   male   BG         NA
#> 10 chris  lawyer   male  NYT -1.0930237
#> 11 chris  lawyer   male TIME         NA
#> 12 chris  lawyer   male USAT         NA
#> 13  john teacher   male   BG  1.1345461
#> 14  john teacher   male  NYT  1.3032946
#> 15  john teacher   male TIME  2.4901830
#> 16  john teacher   male USAT -1.6449096
#> 17  mary  police female   BG         NA
#> 18  mary  police female  NYT         NA
#> 19  mary  police female TIME         NA
#> 20  mary  police female USAT -0.2443080

答案 2 :(得分:0)

一种可能性是pastenamejobsex在一起,得到unique值,然后{{1} CJ值为unique的{​​{1}}。之后,使用from中的cSplitlibrary(splitstackshape)列拆分回三列,将这些列重命名为pasted,将setnames重命名为join设置mydata后。

key