我对R相当陌生,因此仍然学习很多东西。我一直在搜索,但是找不到适合我问题的适当答案。 我有这两个数据集:
d1
Criteria Order Low High
1 a 1 0 10
2 a 1 11 20
3 a 1 21 30
4 b 1 0 13
5 b 1 14 32
6 a 2 5 22
7 a 2 0 4
8 b 2 0 18
然后是d2
Criteria Order Final
1 a 1 13
2 b 2 12
3 a 1 8
4 a 2 2
我想知道当d1
在d2$Final
和d1$Low
内并且条件和顺序都匹配时,是否有办法向d1$High
添加额外的列。我期望获得的将是这样的:
Criteria Order Low High Final
1 a 1 0 10 8
2 a 1 11 20 13
3 a 1 21 30 NA
4 b 1 0 13 NA
5 b 1 14 32 NA
6 a 2 5 22 NA
7 a 2 0 4 2
8 b 2 0 18 12
或者即使在Final列中为真或假的数字输出1/0也可以。
预先感谢
答案 0 :(得分:2)
这使用SQL创建复杂的联接。在Order
周围需要[...]是为了将其与同名关键字区分开。
library(sqldf)
sqldf("select d1.*, d2.Final
from d1
left join d2 on d1.Criteria = d2.Criteria and
d1.[Order] = d2.[Order] and
d2.Final between d1.Low and d1.High")
给出问题中显示的相同输出:
Criteria Order Low High Final
1 a 1 0 10 8
2 a 1 11 20 13
3 a 1 21 30 NA
4 b 1 0 13 NA
5 b 1 14 32 NA
6 a 2 5 22 NA
7 a 2 0 4 2
8 b 2 0 18 12
可复制形式的数据:
Lines1 <- "
Criteria Order Low High
1 a 1 0 10
2 a 1 11 20
3 a 1 21 30
4 b 1 0 13
5 b 1 14 32
6 a 2 5 22
7 a 2 0 4
8 b 2 0 18"
Lines2 <- "
Criteria Order Final
1 a 1 13
2 b 2 12
3 a 1 8
4 a 2 2"
d1 <- read.table(text = Lines1)
d2 <- read.table(text = Lines2)
答案 1 :(得分:1)
如果您的数据“很大”,那么该解决方案将不适合您:笛卡尔联接将爆炸,超出“标准”计算机在内存方面的容忍范围。
但是,如果您的数据足够小(非常相对),则可以执行cartesian-join(也称为完全或完全外部联接)并过滤结果。 (此解决方案是https://www.mango-solutions.com/blog/in-between-a-rock-and-a-conditional-join中一个部分的实现。还有其他部分讨论SQL和fuzzyjoin
,两者都是值得的候选对象。)
三种方言,具体取决于您的喜好。
基本R
a <- merge(d2, d1, all.x=T)
a <- transform(a, Final = ifelse(Low <= Final & Final <= High, Final, NA))
a[!duplicated(a),]
# Criteria Order Final Low High
# 1 a 1 NA 0 10
# 2 a 1 13 11 20
# 3 a 1 NA 21 30
# 4 a 1 8 0 10
# 5 a 1 NA 11 20
# 7 a 2 NA 5 22
# 8 a 2 2 0 4
# 9 b 2 12 0 18
它有一个额外的行,试图优雅地工作...
dplyr
library(dplyr)
full_join(d1, d2) %>%
mutate(Final = if_else(between(Final, Low, High), Final, NA_integer_)) %>%
group_by(Criteria, Order, Low, High) %>%
summarise(Final = coalesce(Final)[1]) %>%
ungroup()
# Joining, by = c("Criteria", "Order")
# # A tibble: 8 x 5
# Criteria Order Low High Final
# <chr> <int> <int> <int> <int>
# 1 a 1 0 10 NA
# 2 a 1 11 20 13
# 3 a 1 21 30 NA
# 4 a 2 0 4 2
# 5 a 2 5 22 NA
# 6 b 1 0 13 NA
# 7 b 1 14 32 NA
# 8 b 2 0 18 12
data.table
library(data.table)
as.data.table(d2)[d1, on = .(Final > Low, Final < High, Criteria, Order),
.(Criteria, Order, Low, High, x.Final)]
# Criteria Order Low High x.Final
# 1: a 1 0 10 8
# 2: a 1 11 20 13
# 3: a 1 21 30 NA
# 4: b 1 0 13 NA
# 5: b 1 14 32 NA
# 6: a 2 5 22 NA
# 7: a 2 0 4 2
# 8: b 2 0 18 12
(还有一种使用data.table::foverlaps
的解决方案,该解决方案可能会更快或更省内存。请阅读链接,它非常有帮助。)
数据:
d1 <- structure(list(Criteria = c("a", "a", "a", "b", "b", "a", "a",
"b"), Order = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), Low = c(0L,
11L, 21L, 0L, 14L, 5L, 0L, 0L), High = c(10L, 20L, 30L, 13L,
32L, 22L, 4L, 18L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
d2 <- structure(list(Criteria = c("a", "b", "a", "a"), Order = c(1L,
2L, 1L, 2L), Final = c(13L, 12L, 8L, 2L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))