Question

数据框d1：

数据框d2：

如何d1合并d2和"x" d1$x，其中d2$x应与完全匹配或x y z 4 10 200 # (4 is matched against next higher value that is 6) 6 20 200 # (6 is matched against 6) 7 30 300 # (7 is matched against next higher value that is 9)中的下一个更高的数字匹配。输出应如下所示：

merge()

如果{{1}}无法做到这一点，那么还有其他办法吗？因为循环非常缓慢。

Answer 1

使用滚动连接与data.table非常简单：

require(data.table)   ## >= 1.9.2
setkey(setDT(d1), x)  ## convert to data.table, set key for the column to join on 
setkey(setDT(d2), x)  ##  same as above

d2[d1, roll=-Inf]

#    x   z  y
# 1: 4 200 10
# 2: 6 200 20
# 3: 7 300 30

Answer 2

输入数据：

d1 <- data.frame(x=c(4,6,7), y=c(10,20,30))
d2 <- data.frame(x=c(3,6,9), z=c(100,200,300))

您基本上希望通过新列扩展d1。所以，让我们复制它。

d3 <- d1

接下来，我假设d2$x的排序非递减且max(d1$x) <= max(d2$x)。

d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]])

对于x中的每个d1$x，其内容为：，从d2$x获取不小于x 的最小值。

根据这些假设，上述内容也可以写成（并且应该更快一点）：

d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)])

结果我们得到：

d3
##   x  y   z
## 1 4 10 200
## 2 6 20 200
## 3 7 30 300

EDIT1 ：受@ MatthewLundberg基于cut的解决方案的启发，这是另一个使用findInterval的人：

d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1]

EDIT2 :(基准）

示例性数据：

set.seed(123)
d1 <- data.frame(x=sort(sample(1:10000, 1000)), y=sort(sample(1:10000, 1000)))
d2 <- data.frame(x=sort(c(sample(1:10000, 999), 10000)), z=sort(sample(1:10000, 1000)))

结果：

microbenchmark::microbenchmark(
{d3 <- d1; d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1] },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]]) },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)]) },
{d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]; merge(d1, d2, by.x='x2', by.y='x')},
{d1a <- d1; setkey(setDT(d1a), x); d2a <- d2; setkey(setDT(d2a), x); d2a[d1a, roll=-Inf] }
)
## Unit: microseconds
##         expr       min            lq    median        uq       max neval
## findInterval   221.102      1357.558  1394.246  1429.767  17810.55   100
## which        66311.738     70619.518 85170.175 87674.762 220613.09   100
## which.max    69832.069     73225.755 83347.842 89549.326 118266.20   100
## cut           8095.411      8347.841  8498.486  8798.226  25531.58   100
## data.table    1668.998      1774.442  1878.028  1954.583  17974.10   100

Answer 3

cut可用于在d2$x中找到d1$x中值的相应匹配项。

查找与cut匹配的计算如下：

as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))
## [1] 2 2 3

这些是值：

d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
[1] 6 6 9

可以将这些添加到d1并执行合并：

d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
merge(d1, d2, by.x='x2', by.y='x')
##   x2 x  y   z
## 1  6 4 10 200
## 2  6 6 20 200
## 3  9 7 30 300

如果需要，可以删除添加的列。

Answer 4

尝试：sapply(d1$x,function(y) d2$z[d2$x > y][which.min(abs(y - d2$x[d2$x > y]))])

R中合并期间的数字比较

4 个答案: