我有以下载体
> X <- c(1,1,3,4)
> a <- c(1,1,2,2)
> b <- c(2,1,4,3)
> c <- c(2,1,4,6)
我想比较X的每个元素与a,b和c的相应元素,最后我需要一个分配给X的每一行的类,例如。
X的第一个元素是1,它在相应的元素向量a中匹配,然后我需要将一个类指定为'1-1'(无论从哪个向量得到匹配)
X的第二个元素是1,它也有匹配(实际上是3)所以同样的类是'1-1'
X的第三个元素是3,它没有匹配,那么我应该寻找下一个整数值,即4,有4(在b和c中)。所以课程应该是'3-4'
X的第四个元素是4,它没有匹配项。也没有5(下一个整数)然后它应该寻找前面的整数,它是3,并且有3.所以该类应该是'4-3'
实际上我每个向量都有数千行,我必须为每一行执行此操作。有任何建议以较简单的方式进行。我更喜欢使用R的基本函数。
答案 0 :(得分:2)
根据rbatt的评论和回答,我意识到我原来的答案非常缺乏。这是重做......
match_nearest <- function( x, table )
{
dist <- x - table
tgt <- which( dist < 0, arr.ind=TRUE, useNames=F )
dist[tgt] <- abs( dist[tgt] + .5 )
table[ cbind( seq_along(x), max.col( -dist, ties.method="first" ) ) ]
}
X <- c(1,1,3,4)
a <- c(1,1,2,2)
b <- c(2,1,4,3)
c <- c(2,1,4,6)
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
## [1] "1-1" "1-1" "3-4" "4-3"
与原始答案和 rbatt相比,我们发现两者都不正确!
set.seed(1)
X <- rbinom(n=1E4, size=10, prob=0.5)
a <- rbinom(n=1E4, size=10, prob=0.5)
b <- rbinom(n=1E4, size=10, prob=0.5)
c <- rbinom(n=1E4, size=10, prob=0.5)
T <- current_solution(X,a,b,c)
R <- rbatt_solution(X,a,b,c)
all.equal( T, R )
## [1] "195 string mismatches"
# Look at mismatched rows...
mismatch <- head( which( T != R ) )
cbind(X,a,b,c)[mismatch,]
## X a b c
## [1,] 4 6 3 3
## [2,] 5 7 4 7
## [3,] 5 8 3 9
## [4,] 5 7 7 4
## [5,] 4 6 3 7
## [6,] 5 7 4 2
T[mismatch]
## [1] "4-3" "5-4" "5-3" "5-4" "4-3" "5-4"
R[mismatch]
## [1] "4-6" "5-7" "5-8" "5-7" "4-6" "5-7"
并且不必要地放慢......
library(microbenchmark)
bm <- microbenchmark( current_solution(X,a,b,c),
previous_solution(X,a,b,c),
rbatt_solution(X,a,b,c) )
print(bm, order="median")
## Unit: milliseconds
## expr min lq median uq max neval
## current_solution(X, a, b, c) 7.088 7.298 7.996 8.268 38.25 100
## rbatt_solution(X, a, b, c) 33.920 38.236 46.524 53.441 85.50 100
## previous_solution(X, a, b, c) 83.082 93.869 101.997 115.961 135.98 100
看起来current_solution
正确;但没有预期输出 ......
这里的功能......
current_solution <- function(X,a,b,c) {
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
}
# DO NOT USE... it is wrong!
previous_solution <- function(X,a,b,c) {
dat <- rbind(X,a,b,c)
v <- apply(dat,2, function(v) {
v2 <- v[1] - v
v2[v2<0] <- abs( v2[v2<0]) - 1
v[ which.min( v2[-1] ) + 1 ]
})
paste("X", v, sep="-")
}
# DO NOT USE... it is wrong!
rbatt_solution <- function(X,a,b,c) {
mat <- cbind(X,a,b,c)
diff.signed <- mat[,"X"]-mat[,c("a","b","c")]
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5
min.ind <- apply(diff.break, 1, which.min)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2)
match.value <- mat[,c("a","b","c")][ind.array]
ref.class <- paste(X, match.value, sep="-")
ref.class
}
答案 1 :(得分:1)
此解决方案应提供您想要的输出。此外,它比Thell的解决方案快3倍,因为差异是矢量化的,不是用apply
逐行计算的。
我比较下面两种方法的时间。请注意,如果您希望“class”作为data.frame中的另一列,则只需取消注释我函数的最后一行。我评论它使两个答案之间的计算时间更具可比性(创建data.frame非常慢)。
# Example data from Thell, plus 1 more
X1 <- c(1,1,3,4,7,1, 5)
a1 <- c(1,1,2,2,2,2, 9)
b1 <- c(2,1,4,3,3,3, 3)
c1 <- c(2,1,4,6,6,6, 7)
# Random example data, much larger
# X1 <- rbinom(n=1E4, size=10, prob=0.5)
# a1 <- rbinom(n=1E4, size=10, prob=0.5)
# b1 <- rbinom(n=1E4, size=10, prob=0.5)
# c1 <- rbinom(n=1E4, size=10, prob=0.5)
rbTest <- function(){
mat <- cbind(X1,a1,b1,c1)
diff.signed <- mat[,"X1"]-mat[,c("a1","b1","c1")] # differences (with sign)
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5 # penalize for differences that are negative by adding 0.5 to them (break ties by preferring higher integer)
min.ind <- apply(diff.break, 1, which.min) # index of smallest difference (prefer larger integers when there is a tie)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2) # array index format
match.value <- mat[,c("a1","b1","c1")][ind.array] # value of the smallest difference (value of the match)
ref.class <- paste(X1, match.value, sep="-") # the 'class' in the format 'ref-match'
ref.class
# data.frame(class=ref.class, mat)
}
thTest <- function(){
dat <- rbind(X1,a1,b1,c1)
apply(dat,2, function(v) {
# Get distance
v2 <- v[1] - v
# Prefer values >= v[1]
v2[v2<0] <- abs( v2[v2<0]) - 1
# Obtain and return nearest v excluding v[1]
v[ which.min( v2[-1] ) + 1 ]
})
}
# > microbenchmark(rbTest(), thTest())
# Unit: milliseconds
# expr min lq median uq max neval
# rbTest() 47.95451 52.01729 59.36161 71.94076 103.1314 100
# thTest() 167.49798 180.69627 195.02828 204.19916 315.0610 100
# > microbenchmark(rbTest(), thTest())
# Unit: microseconds
# expr min lq median uq max neval
# rbTest() 108.299 112.3550 115.4225 119.4630 146.722 100
# thTest() 147.727 152.2015 155.9005 159.3115 235.898 100
# > rbTest()
# [1] "1-1" "1-1" "3-4" "4-3" "7-6" "1-2" "5-7" "6-1"
# > thTest()
# [1] 1 1 4 3 6 2 7