R中的交错连接/新分组/连接与data.table的X [Y]语法

时间:2015-04-23 13:45:37

标签: r data.table

我有两个data.tablessamples, resources

resources通过samplesprimary ID与secondary相关联。 我想首先通过主id将来自资源的信息与sample-table相结合,并且只有当它产生NA时,我才想要从同一个表(在一个data.table命令链中)中使用辅助资源。

# resources:
   primary secondary info
1:      17        42  "I"
2:      18        NA  "J"
3:      19        43  "K"

# samples:
   name primary secondary
1:  "a"      17        55
2:  "b"       0        42
3   "c"      18        42

期望的结果是:

# joined tables:
   name info  # primary secondary
1:  "a"  "I"
2:  "b"  "I"
3:  "c"  "J"

通过primary的第一次加入很容易,它会产生

# Update:
samples <- data.table(name = letters[1:3], 
                      primary = c(17, 0, 18), 
                      secondary = c(55, 42, 42))
resources <- data.table(primary = 17:19, 
                        secondary = c(42, NA, 43), 
                        info = LETTERS[9:11])
# first join:
setkey(samples, primary)
setkey(resources, primary)
samples[resources]

   name info  # primary secondary
1:  "a"  "I"
2:  "b"   NA
3:  "c"  "J"

但是呢?我需要用setkey(samples, secondary)重新键入样本,对吗?然后将子集仅限于那些产生NA的行。但是在一个命令链中所有这一切都不可能实现(并且假设有两个以上的标准......)。我怎样才能更简洁地实现这一目标呢?

...使用data.tables的代码进行了更新。

3 个答案:

答案 0 :(得分:5)

虽然你可以在一条线上做到这一点,但我认为这会掩盖你所做的事情的意义,让事情变得非常难以阅读/理解/调试/记住你一个月内做了什么,而且简直是坏事想法。

更小,更容易消化的块是imo的方式:

setkey(samples, primary)
setkey(resources, primary)
samples[resources, info := i.info]

setkey(samples, secondary)
setkey(resources, secondary)
samples[resources, info := ifelse(is.na(info), i.info, info)]

samples
#   name primary secondary info
#1:    b       0        42    I
#2:    c      18        42    J
#3:    a      17        55    I

# keep going with tertiary and so on if you like

正如@nachti在评论中指出的那样,您可能需要为1.9.5之前的版本添加allow.cartesian=TRUE,具体取决于您的数据。

答案 1 :(得分:2)

这将是一个对resources进行2次调用的链,其中一个在场景后重新设置。

library(data.table)
samples <- data.table(name = letters[1:3], 
                      primary = c(17, 0, 18), 
                      secondary = c(55, 42, 42))
resources <- data.table(primary = 17:19, 
                        secondary = c(42, NA, 43), 
                        info = LETTERS[9:11])
setkey(samples, primary)
setkey(resources, primary)
samples[resources, info := i.info
        ][, .(name, info),, secondary
          ][resources[, info,, secondary], info := ifelse(is.na(info), i.info, info)
            ][, secondary := NULL]

当您询问更复杂的例子时。值得注意的是data.table查询可以通过提前准备子查询参数作为模块轻松管理。它们可以在以后轻松有条件地管理。见下面的例子。

lkp2 <- quote(resources[, info,, secondary])
lkp2_formula <- quote(info := ifelse(is.na(info), i.info, info))
setkey(samples, primary)
samples[resources, info := i.info
        ][, .(name, info),, secondary
          ][eval(lkp2), eval(lkp2_formula)
            ][, secondary := NULL]

如果您严重依赖data.table链接流程,您可能会发现dtq包有用。

答案 2 :(得分:1)

我觉得在一个命令链中做这件事太棘手了,但我为你提供了一个解决方案:

### First step
samples[resources[samples, nomatch = 0], info := info]
samples

   name primary secondary info
1:    b       0        42   NA
2:    a      17        55    I
3:    c      18        42    J

### Second step
setkey(samples, secondary)
setkey(resources, secondary)
## create new column info1
samples[resources[samples[is.na(info)],
                  list(info1 = unique(info)), by = .EACHI],
        info1 := info1]
## merge it to samples, where info is NA
samples[is.na(info), info := info1]
## remove info1 (and maybe other unused columns)
samples[, info1 := NULL]
## sort samples by name
setkey(samples, name)
samples

   name primary secondary info
1:    a      17        55    I
2:    b       0        42    I
3:    c      18        42    J

HTH
〜克