向量函数

时间:2018-05-02 10:10:49

标签: r dataframe vector formatting dplyr

我正在尝试使用SI前缀和可选后缀编写一个按数量级量化数字输入的函数。例如:

1        => 1
23.4     => 23.4
1001     => 1.0k
12345678 => 12.3MB (i.e., "B" is the suffix)

我的第一次尝试大多有效:

format.quantified <- function(n, base = 1000, suffix = "", multiplier = c("k", "M", "G", "T", "P"), threshold = 0.8, sep = "") {
  # Return n, quantified by order of magnitude (relative to base,
  # defaulting to SI prefixes) to one decimal place (or exactly, for
  # non-quantified integers) with an optional suffix for units
  n <- as.numeric(n)
  exponent <- trunc(log(n, base = base))

  is.decimal <- n != trunc(n)

  # Move up to the next multiplier if we're close enough
  # FIXME This condition only applies to the first element, but the
  # increment will be applied to everything
  # if (n / (base ^ (exponent + 1)) >= threshold) { exponent <- exponent + 1 }

  paste(
    ifelse(exponent | is.decimal, sprintf("%.1f", n / (base ^ exponent)), n),
    paste(ifelse(exponent, multiplier[exponent], ""), suffix, sep = ""),
    sep = sep)
}

然而,它有两个问题:

  1. 我想提供一个阈值,它可以在交叉时量化下一个数量级的值。这在上面没有用,因为我是R的新手,当我发现我的函数一次性应用于整个输入数据(而不是逐行)时,我感到很惊讶。 / p>

  2. 凭借这种新发现的矢量化知识,它似乎遭受了一个微妙的,重新排序的错误,我无法确定其原因:

    > format.quantified(c(1,1000,1000000,1000000000))
    [1] "1"    "1.0M" "1.0G" "1.0k"
    

    有趣的是,当应用于具有dplyr mutate功能的数据框(例如,mutate(data, foo = format.qualified(foo))

    我尝试通过欣赏矢量化输入并相应地处理所有内容来解决这个问题:

    format.quantified <- (function() {
      # Use a closure to define the default SI magnitude prefixes
      prefix.default <- data.frame(exponent = c(0,  1,   2,   3,   4,   5,   6),
                                   prefix   = c("", "k", "M", "G", "T", "P", "E"))
    
      is.prefix  <- function(x) { is.data.frame(x) && all(c("exponent", "prefix") %in% colnames(x)) }
      is.decimal <- function(x) { x != trunc(x) }
    
      function(n, suffix = "", threshold = 0.8, base = 1000, prefix.alternative = NA, sep = "") {
        # Return n, quantified by order of magnitude (relative to base)
        # to one decimal place (or exactly, for non-quantified integers)
        # with an optional suffix for units
        prefix <- prefix.default
        if (is.prefix(prefix.alternative)) {
          prefix <- filter(prefix, !exponent %in% prefix.alternative$exponent) %>%
                    bind_rows(prefix.alternative)
        }
    
        q <- data.frame(n = as.numeric(n), exponent = trunc(log(n, base = base))) %>%
             mutate(quantified = n / (base ^ exponent)) %>%
             merge(prefix, by = "exponent", all.x = TRUE)
    
        # TODO Threshold logic using filter and recombining with bind_rows
    
        paste(
          ifelse(q$exponent | is.decimal(q$n), sprintf("%.1f", q$quantified), q$n),
          paste(q$prefix, suffix, sep = ""),
          sep = sep)
      }
    })()
    

    这似乎解决了原始版本中的重新排序错误:

    > format.quantified(c(1,1000,1000000,1000000000))
    [1] "1"    "1.0k" "1.0M" "1.0G"
    > format.quantified(123)
    [1] "123"
    

    然而,在我尝试实现阈值逻辑之前,我注意到当应用mutate的数据帧时,输出中的排序完全搞砸了。经过仔细研究,结果表明,只要输入不是按数字顺序排列,稳定性就会失效:

    > format.quantified(c(1000000,1,1000,1000000,1000000000))
    [1] "1e+06" "1.0k"  "1.0M"  "1.0M"  "1.0G" 
    

    我做错了什么?

    编辑 FWIW,我的第二版功能的阈值逻辑如下:

    q <- data.frame(n = as.numeric(n), exponent = trunc(log(n, base = base))) %>%
         mutate(quantified = n / (base ^ exponent))
    
    q.below <- filter(q, quantified <  base * threshold)
    q.above <- filter(q, quantified >= base * threshold) %>%
               mutate(exponent = exponent + 1, quantified = quantified / base)
    
    Q <- bind_rows(q.below, q.above) %>%
         merge(prefix, by = "exponent", all.x = TRUE)
    

    毋庸置疑,这并没有使订单稳定性问题更好!

0 个答案:

没有答案