Question

首选方法是什么？

使用[和一个命名的关键向量来重新编码另一个向量，直到最近，我才想到一个强大且优选的“R”习惯用于执行一项常见任务。我有更好的方法吗？

有关任务的详细信息：我有一个长度约为1e6的字符向量，每个元素都是一个char长字符串。我想将此向量转换为数字，例如（“B”，“H”，“K”，“M”），它们是一个数量级的缩写（H = 100，M =百万等）变为数字（H = 100，M = 1e6等）任何不在4的集合或NA s中的其他字符将成为1。

经过多次试验和错误后，我已经将其跟踪到子集化向量中的NA s大大减慢了操作的事实。我发现这本质上令人困惑，因为在我看来，NA的子集应该更快，因为它甚至不需要搜索子集化的向量，它只需要返回一个NA。

y <-  c("B", "H", "K", "M")
without_NA <- sample(rep_len(y, 1e6))
with_NA <- sample(rep_len(c(y, NA), 1e6))

convert_exponent_char_to_numeric <- function(exponent) {
  exponent_key <- 10^c(2, 3*1:3)
  names(exponent_key) <- c("H", "K", "M", "B")

  out <- exponent_key[exponent]
  out[is.na(out)] <- 1
  out
}

system.time(convert_exponent_char_to_numeric(without_NA))
   user  system elapsed 
  0.136   0.011   0.147 
system.time(convert_exponent_char_to_numeric(with_NA))
   user  system elapsed 
303.342   0.691 304.237

Answer 1

这是一种解决方法，可以防止因使用NA检测调用的额外代码而降低速度：

y          <-  c("B", "H", "K", "M")
without_NA <- sample(rep_len(y, 1e6))
with_NA    <- sample(rep_len(c(y, NA), 1e6))
with_NA[is.na(with_NA)] <- "NA"

convert_exponent_char_to_numeric <- function(exponent) {
  exponent_key <- 10^c(2, 3*1:3)
  exponent_key <- c(exponent_key, 1)
  names(exponent_key) <- c("H", "K", "M", "B", "NA")

  out <- exponent_key[exponent]
  out
}

system.time(convert_exponent_char_to_numeric(without_NA))

   user  system elapsed 
   0.03    0.01    0.04

system.time(convert_exponent_char_to_numeric(with_NA))

   user  system elapsed 
   0.04    0.01    0.05

现在他们都不到1秒。使用with_NA版本的第二个额外时间的1/100只是因为有5个级别匹配而不是4个。

为什么`[`用NA子集化这么慢？

1 个答案: