Question

我正在尝试使用SI前缀和可选后缀编写一个按数量级量化数字输入的函数。例如：

1        => 1
23.4     => 23.4
1001     => 1.0k
12345678 => 12.3MB (i.e., "B" is the suffix)

我的第一次尝试大多有效：

format.quantified <- function(n, base = 1000, suffix = "", multiplier = c("k", "M", "G", "T", "P"), threshold = 0.8, sep = "") {
  # Return n, quantified by order of magnitude (relative to base,
  # defaulting to SI prefixes) to one decimal place (or exactly, for
  # non-quantified integers) with an optional suffix for units
  n <- as.numeric(n)
  exponent <- trunc(log(n, base = base))

  is.decimal <- n != trunc(n)

  # Move up to the next multiplier if we're close enough
  # FIXME This condition only applies to the first element, but the
  # increment will be applied to everything
  # if (n / (base ^ (exponent + 1)) >= threshold) { exponent <- exponent + 1 }

  paste(
    ifelse(exponent | is.decimal, sprintf("%.1f", n / (base ^ exponent)), n),
    paste(ifelse(exponent, multiplier[exponent], ""), suffix, sep = ""),
    sep = sep)
}

然而，它有两个问题：

我想提供一个阈值，它可以在交叉时量化下一个数量级的值。这在上面没有用，因为我是R的新手，当我发现我的函数一次性应用于整个输入数据（而不是逐行）时，我感到很惊讶。 / p>

凭借这种新发现的矢量化知识，它似乎遭受了一个微妙的，重新排序的错误，我无法确定其原因：

> format.quantified(c(1,1000,1000000,1000000000))
[1] "1"    "1.0M" "1.0G" "1.0k"

有趣的是，当应用于具有dplyr mutate功能的数据框（例如，mutate(data, foo = format.qualified(foo))）

我尝试通过欣赏矢量化输入并相应地处理所有内容来解决这个问题：

format.quantified <- (function() {
  # Use a closure to define the default SI magnitude prefixes
  prefix.default <- data.frame(exponent = c(0,  1,   2,   3,   4,   5,   6),
                               prefix   = c("", "k", "M", "G", "T", "P", "E"))

  is.prefix  <- function(x) { is.data.frame(x) && all(c("exponent", "prefix") %in% colnames(x)) }
  is.decimal <- function(x) { x != trunc(x) }

  function(n, suffix = "", threshold = 0.8, base = 1000, prefix.alternative = NA, sep = "") {
    # Return n, quantified by order of magnitude (relative to base)
    # to one decimal place (or exactly, for non-quantified integers)
    # with an optional suffix for units
    prefix <- prefix.default
    if (is.prefix(prefix.alternative)) {
      prefix <- filter(prefix, !exponent %in% prefix.alternative$exponent) %>%
                bind_rows(prefix.alternative)
    }

    q <- data.frame(n = as.numeric(n), exponent = trunc(log(n, base = base))) %>%
         mutate(quantified = n / (base ^ exponent)) %>%
         merge(prefix, by = "exponent", all.x = TRUE)

    # TODO Threshold logic using filter and recombining with bind_rows

    paste(
      ifelse(q$exponent | is.decimal(q$n), sprintf("%.1f", q$quantified), q$n),
      paste(q$prefix, suffix, sep = ""),
      sep = sep)
  }
})()

这似乎解决了原始版本中的重新排序错误：

> format.quantified(c(1,1000,1000000,1000000000))
[1] "1"    "1.0k" "1.0M" "1.0G"
> format.quantified(123)
[1] "123"

然而，在我尝试实现阈值逻辑之前，我注意到当应用mutate的数据帧时，输出中的排序完全搞砸了。经过仔细研究，结果表明，只要输入不是按数字顺序排列，稳定性就会失效：

> format.quantified(c(1000000,1,1000,1000000,1000000000))
[1] "1e+06" "1.0k"  "1.0M"  "1.0M"  "1.0G"

我做错了什么？

编辑 FWIW，我的第二版功能的阈值逻辑如下：

q <- data.frame(n = as.numeric(n), exponent = trunc(log(n, base = base))) %>%
     mutate(quantified = n / (base ^ exponent))

q.below <- filter(q, quantified <  base * threshold)
q.above <- filter(q, quantified >= base * threshold) %>%
           mutate(exponent = exponent + 1, quantified = quantified / base)

Q <- bind_rows(q.below, q.above) %>%
     merge(prefix, by = "exponent", all.x = TRUE)

毋庸置疑，这并没有使订单稳定性问题更好！

向量函数

0 个答案: