性能

Question

论证的要点如下：

我写的一个函数，考虑了一个参数，一个字母数字字符串，并且应该输出一个字符串，其中这个字母数字字符串的每个元素的值被切换以用于某些＆＃39;映射＆＃39;。 MRE如下：

#This is the original and switches value map
map = data.table(mapped = c(0:35), original = c(0:9,LETTERS))
#the function that I'm using:
as_numbers <- function(string) {
  #split string unlisted
  vector_unlisted <- unlist(strsplit(string,""))
  #match the string in vector
  for (i in 1:length(vector_unlisted)) {

    vector_unlisted[i] <- subset(map, map$original==vector_unlisted[i])[[1]][1]

  }
  vector_unlisted <- paste0(vector_unlisted, collapse = "")

  return(vector_unlisted)
}

我正试图摆脱for loop以提高性能，因为该功能正常工作，但对于我以这种形式提供的元素数量而言，这是非常缓慢的：

unlist(lapply(dat$alphanum, function(x) as_numbers(x)))

输入字符串的示例是：549300JV8KEETQJYUG13。这应该会产生类似5493001931820141429261934301613

的字符串

在这种情况下只提供一个字符串：

> as_numbers("549300JV8KEETQJYUG13")
[1] "5493001931820141429261934301613"

Answer 1

我们可以使用基本转换：

#input and expected output
x <- "549300JV8KEETQJYUG13"
# "5493001931820141429261934301613"

#output
res <- paste0(strtoi(unlist(strsplit(x, "")), base = 36), collapse = "")

#test output
as_numbers(x) == res
# [1] TRUE

性能

由于这篇文章是关于性能的，因此这里有3个解决方案的基准*：

#input set up
map = data.table(mapped = c(0:35), original = c(0:9,LETTERS))
x <- rep(c("549300JV8KEETQJYUG13", "5493V8KE300J"), 1000)

#define functions
base_f <- function(string) {
  sapply(string, function(x) {
    paste0(strtoi(unlist(strsplit(x, "")), base = 36), collapse = "")
    })
  }

match_f <- function(string) {
  mapped <- map$mapped
  original <- map$original
  sapply(strsplit(string, ""), function(y) {
    paste0(mapped[match(y, original)], collapse= "")})
  }

reduce_f <- function(string) {
  Reduce(function(string,r) 
    gsub(map$original[r],
         map$mapped[r], string, fixed = TRUE),
    seq_len(nrow(map)), string)
  }

#test if all return same output
all(base_f(x) == match_f(x))
# [1] TRUE
all(base_f(x) == reduce_f(x))
# [1] TRUE

library(rbenchmark)
benchmark(replications = 1000,
          base_f(x),
          match_f(x),
          reduce_f(x))
#          test replications elapsed relative user.self sys.self user.child sys.child
# 1   base_f(x)         1000   22.15    4.683     22.12        0         NA        NA
# 2  match_f(x)         1000   19.18    4.055     19.11        0         NA        NA
# 3 reduce_f(x)         1000    4.73    1.000      4.72        0         NA        NA

_{*注意： microbenchmark（）不断抛出警告，因此使用 rbenchmark（）。随意测试其他库并更新这篇文章。}

Answer 2

使用Reduce和gsub，您可以定义以下功能

replacer <- function(x) Reduce(function(x,r) gsub(map$original[r],
             map$mapped[r], x, fixed=T), seq_len(nrow(map)),x)


# Let's test it
replacer("549300JV8KEETQJYUG13")
#[1] "5493001931820141429261934301613"

Answer 3

似乎是合并：

map[as.data.table(unlist(strsplit(string, ""))),
    .(mapped), on = c(original = "V1")][ , paste0(mapped, collapse = "")]

请注意，“D1”和“1V”都将映射到“131”......

在您的示例输出中是："5493001931820141429261934301613"

如果您确实希望将其作为可逆映射，则可以使用sep = "."

Answer 4

我会使用match：

as_numbers <- function(string) {
  lapply(strsplit(string, ""), function(y) {
    paste0(map$mapped[match(y, map$original)], collapse= "")})
}

as_numbers(c("549300JV8KEETQJYUG13", "5493V8KE300J"))
#[[1]]
#[1] "5493001931820141429261934301613"
#
#[[2]]
#[1] "5493318201430019"

添加lapply调用处理长度＆gt; 1输入正确。

如果您需要进一步加快速度，可以将map$mapped和map$original存储在不同的向量中，并在match来电而不是map$...中使用它们，这样您就不会; t需要对data.frame / data.table进行多次子集化（这非常昂贵）。

由于Q是关于性能的，因此这里有两个解决方案的基准：

map = data.table(mapped = c(0:35), original = c(0:9,LETTERS))
x <- rep(c("549300JV8KEETQJYUG13", "5493V8KE300J"), 1000)

ascii_func <- function(string) {
  lapply(string, function(x) {
    x_ascii <- strtoi(charToRaw(x), 16)
    paste(ifelse(x_ascii >= 65 & x_ascii <= 90,
                  x_ascii - 55, x_ascii - 48),
                  collapse = "")
  })
}

match_func <- function(string) {
  mapped <- map$mapped
  original <- map$original
    lapply(strsplit(string, ""), function(y) {
      paste0(mapped[match(y, original)], collapse= "")})
}

library(microbenchmark)
microbenchmark(ascii_func(x), match_func(x), times = 25L)
#Unit: milliseconds
#          expr   min    lq  mean median     uq    max neval
# ascii_func(x) 83.47 92.55 96.91  96.82 103.06 112.07    25
# match_func(x) 24.30 24.74 26.86  26.11  28.67  31.55    25

identical(ascii_func(x), match_func(x))
#[1] TRUE

通过远离for循环来提高性能

4 个答案:

性能