通过匹配多个条件在一个数据框中基于另一个数据过滤记录

时间:2019-12-16 18:30:53

标签: r dplyr

我有以下两个数据帧dat1dat2

library(tidyverse)
dat1 <- tribble(
  ~"subj", ~"drive", ~"measure",
  "A", 1, 1,
  "A", 1, 2,
  "A", 1, 3,
  "A", 1, 4,
  "A", 1, 5,
  "A", 2, 1,
  "A", 2, 2,
  "A", 2, 3,
  "A", 2, 4,
  "A", 2, 5,
  "B", 1, 1,
  "B", 1, 2,
  "B", 1, 3,
  "B", 1, 4,
  "B", 1, 5,
  "B", 2, 1,
  "B", 2, 2,
  "B", 2, 3,
  "B", 2, 4,
  "B", 2, 5,
)

dat2 <- tribble(
  ~"subj", ~"drive", ~"measure",
  "A", 1, 3,
  "B", 2, 4
)

我正在尝试根据以下条件过滤dat1中的记录:

    subj
  • drivedat1列应与subj的{​​{1}}和drive列相匹配,并且
  • dat2中的measure值应在dat1measure值的范围内。

在此示例中,将范围相隔一个单位。因此,我的结果数据框将如下所示:

dat2

我知道result <- tribble( ~"subj", ~"drive", ~"measure", "A", 1, 2, "A", 1, 3, "A", 1, 4, "B", 2, 3, "B", 2, 4, "B", 2, 5 ) ,但是它不允许我根据范围进行过滤。有什么想法可以解决这个问题吗?基于dplyr::semi_join()的解决方案将会很棒!

3 个答案:

答案 0 :(得分:4)

编辑为使用GG的注释中提到的本机sqldf字符串替换,而不是sprintf。

library(sqldf)

check_range <- 1

fn$sqldf('
select  one.*
from    dat1 one
        join dat2 two
          on  one.subj = two.subj
              and one.drive = two.drive
              and one.measure - two.measure between -`check_range` and `check_range`
')
#   subj drive measure
# 1    A     1       2
# 2    A     1       3
# 3    A     1       4
# 4    B     2       3
# 5    B     2       4
# 6    B     2       5

答案 1 :(得分:3)

一种选择是先做inner_join,然后再使用between

library(dplyr)
inner_join(dat1, dat2, by = c('subj', 'drive')) %>% 
    group_by(subj, drive) %>% 
    filter(between(measure.x, first(measure.y)-1, first(measure.y) + 1)) %>% 
    select(measure = measure.x)
# A tibble: 6 x 3
# Groups:   subj, drive [2]
#  subj  drive measure
#  <chr> <dbl>   <dbl>
#1 A         1       2
#2 A         1       3
#3 A         1       4
#4 B         2       3
#5 B         2       4
#6 B         2       5

或带有data.table

library(data.table)
setDT(dat1)[setDT(dat2), .SD[between(measure, i.measure -1,
          i.measure + 1)], on = .(subj, drive), by = .EACHI]
#    subj drive measure
#1:    A     1       2
#2:    A     1       3
#3:    A     1       4
#4:    B     2       3
#5:    B     2       4
#6:    B     2       5

答案 2 :(得分:1)

为了完整起见,这也是使用非装备联接的解决方案:

library(data.table)
range <- 1
idx <- setDT(dat1)[
  setDT(dat2)[, .(subj, drive, lower = measure - range, upper = measure + range)], 
  on = .(subj, drive, measure >= lower, measure <= upper), which = TRUE]
dat1[idx]
   subj drive measure
1:    A     1       2
2:    A     1       3
3:    A     1       4
4:    B     2       3
5:    B     2       4
6:    B     2       5