R:按两列和按组过滤两个data.frames,以获取重复值

时间:2014-09-30 22:43:36

标签: r filter compare subset

我有一个data.frame dat,用于存储我的普通数据,组由ID定义。

data <- structure(list(NAME = structure(c(1L, 1L, 2L), .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 23L, 57L), REF_YEAR = c(1920L, 1938L, 1869L), SURV_YEAR = c(1938L, 1962L, 1872L), VALUE = c(20L, 40L, 34L)), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR","VALUE"), class = "data.frame", row.names = c(NA, -3L))

  NAME  ID REF_YEAR SURV_YEAR VALUE
1 NAME1 23     1920      1938    20
2 NAME1 23     1938      1962    40
3 NAME2 57     1869      1872    34

我有第二个data.framedat_q我希望与dat进行比较

dat_q <- structure(list(NAME = structure(1:2, .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 57L), REF_YEAR = c(1934L, 1866L), SURV_YEAR = c(1938L, 1868L), VALUE = structure(1:2, .Label = c("A", "B"), class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -2L))

  NAME  ID REF_YEAR SURV_YEAR VALUE
1 NAME1 23     1934      1938     A
2 NAME2 57     1866      1868     B

我的问题:如何删除dat_qREF_YEARSURV_YEAR中包含相等值的所有行,而不是dat的相同列中的所有行}(在示例数据1938中)?这应该按组(由ID定义)而不是整个data.frame

应用

最后,使用我的样本数据,这将是来自过滤dat_q

的结果
  NAME  ID REF_YEAR SURV_YEAR VALUE
2 NAME2 57     1866      1868     B

修改

以下是@thelatemail提供的代码无法使用的其他一些示例数据。而且我无法弄清楚为什么,dat_q应该被过滤掉,因为它包含与dat完全相同的值。

data <- structure(list(NAME = structure(c(1L, 1L, 1L), .Label = "NAME1", class = "factor"), ID = c(226L, 226L, 226L), SURV_YEAR = c(2009L, 2010L, 2012L), REF_YEAR = c(2008L, 2009L, 2011L), VALUE = c(-7L, -37L,  -51L)), .Names = c("NAME", "ID", "SURV_YEAR", "REF_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -3L))

   NAME  ID SURV_YEAR REF_YEAR VALUE
1 NAME1 226      2009     2008    -7
2 NAME1 226      2010     2009   -37
3 NAME1 226      2012     2011   -51

dat_q <- structure(list(NAME = structure(1L, .Label = "NAME1", class = "factor"), ID = 226L, REF_YEAR = 2010L, SURV_YEAR = 2011L, VALUE = structure(1L, .Label = "-X", class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -1L))

  NAME   ID REF_YEAR SURV_YEAR VALUE
1 NAME1 226     2010      2011    -X

1 个答案:

答案 0 :(得分:5)

我喜欢基础R中的by来弄清楚这类问题的逻辑。这有效,但可能有点慢:

do.call(rbind,by(
  dat_q,
  dat_q$ID,
  function(x) {
    subdata <- data[data$ID==x$ID,]
    x[!(x$REF_YEAR %in% subdata$REF_YEAR | x$SURV_YEAR %in% subdata$SURV_YEAR),]
  }
))

#    NAME ID REF_YEAR SURV_YEAR VALUE
#57 NAME2 57     1866      1868     B

遵循相同逻辑的data.table解决方案可能会更快:

library(data.table)
setDT(dat_q)
setDT(data)
dat_q[
     ,
     .SD[!(REF_YEAR   %in% data$REF_YEAR[data[,ID==.BY]] | 
           SURV_YEAR  %in% data$SURV_YEAR[data[,ID==.BY]])],
     by=ID
]

#   ID  NAME REF_YEAR SURV_YEAR VALUE
#1: 57 NAME2     1866      1868     B

使用data.table,我认为您也可以这样做。转换为data.tables后,

# using 1.9.3+, just remove `by=.EACHI` if you're using <= 1.9.2
setkey(data, ID)
setkey(dat_q, ID)

idx = data[dat_q, any(c(i.REF_YEAR, i.SURV_YEAR) %in% c(REF_YEAR, SURV_YEAR)), by=.EACHI]$V1
dat_q[!idx]
#     NAME ID REF_YEAR SURV_YEAR VALUE
# 1: NAME2 57     1866      1868     B

我们在关键列上执行连接,并在与data对应的dat_q的每个匹配行上,我们计算j中的表达式。这为我们提供了以后索引/子集dat_q所需的逻辑值。