我有两个非常大的向量,需要将它们与定界符连接起来以形成唯一的ID。例如:
set.seed(1)
vec1 <- sample(1:10, 10000000, replace = T)
vec2 <- sample(1:1000000000, 10000000))
我当前正在使用paste0():
system.time({
uniq_id <- paste0(vec1, "_", vec2)
})
但是,由于vec1和vec2的大小,这非常慢。是否有性能更高的替代方法?
答案 0 :(得分:2)
一种更有效的方法是stringi::stri_c
library(microbenchmark)
b <- microbenchmark(
paste = paste0(vec1, "_", vec2),
stringi = stringi::stri_c(vec1, vec2, sep = "_"),
times = 10
)
结果
b
#Unit: seconds
# expr min lq mean median uq max neval cld
# paste 5.475398 5.509957 5.544477 5.542728 5.566904 5.632173 10 b
# stringi 3.862541 3.871826 3.896242 3.897264 3.914894 3.934175 10 a
答案 1 :(得分:1)
比较 paste
、paste0
(R 版本 4.1.0)、stringi::stri_c
(版本 1.6.2)和 stringr::str_c
(版本 1.4.0)我无法观察到性能差异很大,但这可能取决于将连接的内容。如果使用数字或字符,以及字符是否由数字或字母组成,则有很大不同。当只有字母 stringi 和 stringr 时,接缝比粘贴快。
M <- alist(
paste0 = paste0(vec1, "_", vec2)
, paste = paste(vec1, "_", vec2, sep = "")
, pasteS = paste(vec1, vec2, sep = "_")
, stringi = stringi::stri_c(vec1, "_", vec2)
, stringiS = stringi::stri_c(vec1, vec2, sep = "_")
, stringr = stringr::str_c(vec1, "_", vec2)
, stringrS = stringr::str_c(vec1, vec2, sep = "_")
)
set.seed(42)
n <- 1e5
vec1 <- sample(1:10, n, TRUE)
vec2 <- sample(1:1000000000, n, TRUE)
bench::mark(exprs = M)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 paste0 62.8ms 63.9ms 15.6 2.29MB 2.23 7 1 447ms
#2 paste 61.9ms 63ms 15.9 2.29MB 0 8 0 503ms
#3 pasteS 57.5ms 58.6ms 17.1 2.29MB 2.13 8 1 468ms
#4 stringi 57.1ms 57.6ms 17.2 2.29MB 0 9 0 524ms
#5 stringiS 56.2ms 66.2ms 14.4 2.29MB 2.40 6 1 417ms
#6 stringr 57.9ms 62.9ms 14.8 2.29MB 0 8 0 541ms
#7 stringrS 55ms 61.4ms 15.3 2.29MB 0 8 0 523ms
vec1 <- as.character(vec1)
vec2 <- as.character(vec2)
bench::mark(exprs = M)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 paste0 34.2ms 35.3ms 28.2 781KB 2.17 13 1 460ms
#2 paste 35.1ms 35.7ms 27.9 781KB 0 14 0 502ms
#3 pasteS 32ms 33.5ms 29.9 781KB 2.14 14 1 468ms
#4 stringi 33.7ms 35.6ms 28.1 781KB 0 15 0 534ms
#5 stringiS 32.6ms 33.9ms 29.6 781KB 2.12 14 1 472ms
#6 stringr 34.6ms 34.9ms 28.5 781KB 0 15 0 526ms
#7 stringrS 33.1ms 33.4ms 29.7 781KB 2.12 14 1 471ms
set.seed(42)
n <- 1e5
vec1 <- as.character(sample(0:9, n, TRUE))
vec2 <- as.character(sample(0:9, n, TRUE))
bench::mark(exprs = M)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 paste0 18.9ms 19ms 52.4 781KB 2.02 26 1 496ms
#2 paste 18.9ms 19ms 52.5 781KB 0 27 0 514ms
#3 pasteS 15.2ms 15.3ms 65.3 781KB 2.04 32 1 490ms
#4 stringi 15.1ms 15.1ms 65.7 781KB 0 33 0 502ms
#5 stringiS 13.5ms 13.5ms 73.7 781KB 2.05 36 1 489ms
#6 stringr 15.1ms 15.2ms 65.7 781KB 2.05 32 1 487ms
#7 stringrS 13.4ms 13.5ms 73.3 781KB 0 37 0 505ms
set.seed(42)
n <- 1e5
vec1 <- sample(letters, n, TRUE)
vec2 <- sample(LETTERS, n, TRUE)
bench::mark(exprs = M)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 paste0 15.98ms 16.02ms 61.5 781KB 2.05 30 1 488ms
#2 paste 16.02ms 16.09ms 62.1 781KB 2.07 30 1 483ms
#3 pasteS 11.96ms 12.03ms 83.0 781KB 2.02 41 1 494ms
#4 stringi 7.97ms 8.07ms 123. 781KB 4.18 59 2 478ms
#5 stringiS 6.37ms 6.43ms 154. 781KB 4.12 75 2 486ms
#6 stringr 7.97ms 8.02ms 124. 781KB 2.04 61 1 491ms
#7 stringrS 6.43ms 6.49ms 153. 781KB 4.09 75 2 489ms