r按行比较两个data.tables

时间:2019-07-11 16:27:28

标签: r data.table

我有两个要比较的数据表。

但是不知道为什么会有警告

DT1 <- data.table(ID=c("F","A","E","B","C","D","C"),
                  num=c(59,3,108,11,22,54,241),
                  value=c(90,47,189,38,42,86,280),
                  Mark=c("Mary","Tom","Abner","Norman","Joanne",
                  "Bonnie","Trista"))

DT2 <- data.table(Mark=c("Mary","Abner","Bonnie","Trista","Norman"),
                  numA=c(48,20,88,237,20),
                  numB=c(60,326,54,268,89),
                  valueA=c(78,34,78,270,59),
                  valueB=c(90,190,90,385,75))

DToutput <- DT1[(num > DT2$numA & num < DT2$numB &
                value > DT2$valueA & value < DT2$valueB)]

我的目标

我想基于num中的value找到MarkDT1,并且在numAnumB中有一个范围DT2

例如

对于FDT1num = 59value = 90中的行Mark = "Mary"。因此,在使用by=Mary时,您还必须匹配:

num(59) > DT2$numA(48) & num(59) < DT2$numB(60) & value(90) > DT2$valueA(78) & value(90) < DT2$valueB(90)

您会看到90 < 90不成立,因此我的结果将没有行F

我收到此警告:

Warning messages:
 1: In num > DT2$numA : longer object length is not a multiple of shorter object lengt
 2: In num < DT2$numB : longer object length is not a multiple of shorter object lengt
 3: In value > DT2$valueA : longer object length is not a multiple of shorter object lengt
 4: In value < DT2$valueB : longer object length is not a multiple of shorter object lengt

如何修改它以完成我想做的事情?

谢谢

已添加:DT2中可以使用多个相同的标记,但是值不在同一范围内。这会影响比较吗?

2 个答案:

答案 0 :(得分:3)

另一个使用非等价内部联接的选项:

DT2[DT1, on=.(Mark=Mark, numA<num, numB>num, valueA<value, valueB>value), nomatch=0L, 
    .(ID, num, value, Mark)]

或:

DT1[DT2, on=.(Mark, num>numA, num<numB, value>valueA, value<valueB), nomatch=0L, 
    .(ID, num=x.num, value=x.value, Mark)]

输出:

   ID num value   Mark
1:  E 108   189  Abner
2:  C 241   280 Trista

答案 1 :(得分:0)

这通常是您要找的东西吗?我加入了数据表,并根据您的条件使用between进行了过滤。如果这不是您要查找的内容,是否可以发布预期输出的数据表?

library(data.table)

DT1[DT2, on = "Mark"][between(num, numA, numB, incbounds = F) & between(value, valueA, valueB, incbounds = F)]

   ID num value   Mark numA numB valueA valueB
1:  E 108   189  Abner   20  326     34    190
2:  C 241   280 Trista  237  268    270    385

编辑: 此方法与@ Chinsoon12的非等值内部联接之间的基准比较表明,非等值内部联接的速度要快得多,甚至需要更多数据。这不是一个完美的基准(只是重复data.table),但我仍然认为很明显,非等价内联接的效率要高得多。

Unit: milliseconds
           expr      min       lq      mean    median       uq      max neval
        between 233.6378 265.4323 303.14039 301.82455 334.3225 373.2760    10
 non_equi_inner  71.6925  74.1547  96.96584  91.14375  97.6664 179.9907    10

基准代码:

DT1 <- data.table(sapply(DT1, rep, 1e3))[, c("num", "value") := lapply(.SD, as.integer), .SDcols = c("num", "value")]
DT2 <- data.table(sapply(DT2, rep, 1e3))[, c("numA", "numB", "valueA", "valueB") := lapply(.SD, as.integer), .SDcols = c("numA", "numB", "valueA", "valueB")]

microbenchmark::microbenchmark(
  between = {
    DT1[DT2, on = "Mark", allow.cartesian = T][between(num, numA, numB, incbounds = F) & between(value, valueA, valueB, incbounds = F)]

  },
  non_equi_inner = {
    DT1[DT2, on=.(Mark, num>numA, num<numB, value>valueA, value<valueB), nomatch=0L, 
        .(ID, num=x.num, value=x.value, Mark), allow.cartesian = T]
  },
  times = 10

)