R比较一个数据框的值和另一数据框的值

时间:2018-11-13 16:12:31

标签: r dataframe comparison

我对R相当陌生,因此仍然学习很多东西。我一直在搜索,但是找不到适合我问题的适当答案。 我有这两个数据集:

d1
    Criteria Order Low High
1        a     1   0   10
2        a     1  11   20
3        a     1  21   30
4        b     1   0   13
5        b     1  14   32
6        a     2   5   22
7        a     2   0    4
8        b     2   0   18

然后是d2

 Criteria Order Final
1        a     1    13
2        b     2    12
3        a     1     8
4        a     2     2 

我想知道当d1d2$Finald1$Low内并且条件和顺序都匹配时,是否有办法向d1$High添加额外的列。我期望获得的将是这样的:

 Criteria Order Low High Final
1        a     1   0   10     8
2        a     1  11   20    13
3        a     1  21   30    NA
4        b     1   0   13    NA
5        b     1  14   32    NA
6        a     2   5   22    NA
7        a     2   0    4     2
8        b     2   0   18    12  

或者即使在Final列中为真或假的数字输出1/0也可以。

预先感谢

2 个答案:

答案 0 :(得分:2)

这使用SQL创建复杂的联接。在Order周围需要[...]是为了将其与同名关键字区分开。

library(sqldf)

sqldf("select d1.*, d2.Final
  from d1 
  left join d2 on d1.Criteria = d2.Criteria and
                  d1.[Order] = d2.[Order] and
                  d2.Final between d1.Low and d1.High")

给出问题中显示的相同输出:

  Criteria Order Low High Final
1        a     1   0   10     8
2        a     1  11   20    13
3        a     1  21   30    NA
4        b     1   0   13    NA
5        b     1  14   32    NA
6        a     2   5   22    NA
7        a     2   0    4     2
8        b     2   0   18    12

注意

可复制形式的数据:

Lines1 <- "
    Criteria Order Low High
1        a     1   0   10
2        a     1  11   20
3        a     1  21   30
4        b     1   0   13
5        b     1  14   32
6        a     2   5   22
7        a     2   0    4
8        b     2   0   18"

Lines2 <- "
  Criteria Order Final
1        a     1    13
2        b     2    12
3        a     1     8
4        a     2     2"

d1 <- read.table(text = Lines1)
d2 <- read.table(text = Lines2)

答案 1 :(得分:1)

如果您的数据“很大”,那么该解决方案将不适合您:笛卡尔联接将爆炸,超出“标准”计算机在内存方面的容忍范围。

但是,如果您的数据足够小(非常相对),则可以执行cartesian-join(也称为完全或完全外部联接)并过滤结果。 (此解决方案是https://www.mango-solutions.com/blog/in-between-a-rock-and-a-conditional-join中一个部分的实现。还有其他部分讨论SQL和fuzzyjoin,两者都是值得的候选对象。)

三种方言,具体取决于您的喜好。

基本R

a <- merge(d2, d1, all.x=T)
a <- transform(a, Final = ifelse(Low <= Final & Final <= High, Final, NA))
a[!duplicated(a),]
#   Criteria Order Final Low High
# 1        a     1    NA   0   10
# 2        a     1    13  11   20
# 3        a     1    NA  21   30
# 4        a     1     8   0   10
# 5        a     1    NA  11   20
# 7        a     2    NA   5   22
# 8        a     2     2   0    4
# 9        b     2    12   0   18

它有一个额外的行,试图优雅地工作...

dplyr

library(dplyr)
full_join(d1, d2) %>%
  mutate(Final = if_else(between(Final, Low, High), Final, NA_integer_)) %>%
  group_by(Criteria, Order, Low, High) %>%
  summarise(Final = coalesce(Final)[1]) %>%
  ungroup()
# Joining, by = c("Criteria", "Order")
# # A tibble: 8 x 5
#   Criteria Order   Low  High Final
#   <chr>    <int> <int> <int> <int>
# 1 a            1     0    10    NA
# 2 a            1    11    20    13
# 3 a            1    21    30    NA
# 4 a            2     0     4     2
# 5 a            2     5    22    NA
# 6 b            1     0    13    NA
# 7 b            1    14    32    NA
# 8 b            2     0    18    12

data.table

library(data.table)
as.data.table(d2)[d1, on = .(Final > Low, Final < High, Criteria, Order),
                  .(Criteria, Order, Low, High, x.Final)]
#    Criteria Order Low High x.Final
# 1:        a     1   0   10       8
# 2:        a     1  11   20      13
# 3:        a     1  21   30      NA
# 4:        b     1   0   13      NA
# 5:        b     1  14   32      NA
# 6:        a     2   5   22      NA
# 7:        a     2   0    4       2
# 8:        b     2   0   18      12

(还有一种使用data.table::foverlaps的解决方案,该解决方案可能会更快或更省内存。请阅读链接,它非常有帮助。)


数据:

d1 <- structure(list(Criteria = c("a", "a", "a", "b", "b", "a", "a", 
"b"), Order = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), Low = c(0L, 
11L, 21L, 0L, 14L, 5L, 0L, 0L), High = c(10L, 20L, 30L, 13L, 
32L, 22L, 4L, 18L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8"))
d2 <- structure(list(Criteria = c("a", "b", "a", "a"), Order = c(1L, 
2L, 1L, 2L), Final = c(13L, 12L, 8L, 2L)), class = "data.frame", row.names = c("1", 
"2", "3", "4"))