我正在尝试使用SI前缀和可选后缀编写一个按数量级量化数字输入的函数。例如:
1 => 1
23.4 => 23.4
1001 => 1.0k
12345678 => 12.3MB (i.e., "B" is the suffix)
我的第一次尝试大多有效:
format.quantified <- function(n, base = 1000, suffix = "", multiplier = c("k", "M", "G", "T", "P"), threshold = 0.8, sep = "") {
# Return n, quantified by order of magnitude (relative to base,
# defaulting to SI prefixes) to one decimal place (or exactly, for
# non-quantified integers) with an optional suffix for units
n <- as.numeric(n)
exponent <- trunc(log(n, base = base))
is.decimal <- n != trunc(n)
# Move up to the next multiplier if we're close enough
# FIXME This condition only applies to the first element, but the
# increment will be applied to everything
# if (n / (base ^ (exponent + 1)) >= threshold) { exponent <- exponent + 1 }
paste(
ifelse(exponent | is.decimal, sprintf("%.1f", n / (base ^ exponent)), n),
paste(ifelse(exponent, multiplier[exponent], ""), suffix, sep = ""),
sep = sep)
}
然而,它有两个问题:
我想提供一个阈值,它可以在交叉时量化下一个数量级的值。这在上面没有用,因为我是R的新手,当我发现我的函数一次性应用于整个输入数据(而不是逐行)时,我感到很惊讶。 / p>
凭借这种新发现的矢量化知识,它似乎遭受了一个微妙的,重新排序的错误,我无法确定其原因:
> format.quantified(c(1,1000,1000000,1000000000))
[1] "1" "1.0M" "1.0G" "1.0k"
有趣的是,当应用于具有dplyr
mutate
功能的数据框(例如,mutate(data, foo = format.qualified(foo))
)
我尝试通过欣赏矢量化输入并相应地处理所有内容来解决这个问题:
format.quantified <- (function() {
# Use a closure to define the default SI magnitude prefixes
prefix.default <- data.frame(exponent = c(0, 1, 2, 3, 4, 5, 6),
prefix = c("", "k", "M", "G", "T", "P", "E"))
is.prefix <- function(x) { is.data.frame(x) && all(c("exponent", "prefix") %in% colnames(x)) }
is.decimal <- function(x) { x != trunc(x) }
function(n, suffix = "", threshold = 0.8, base = 1000, prefix.alternative = NA, sep = "") {
# Return n, quantified by order of magnitude (relative to base)
# to one decimal place (or exactly, for non-quantified integers)
# with an optional suffix for units
prefix <- prefix.default
if (is.prefix(prefix.alternative)) {
prefix <- filter(prefix, !exponent %in% prefix.alternative$exponent) %>%
bind_rows(prefix.alternative)
}
q <- data.frame(n = as.numeric(n), exponent = trunc(log(n, base = base))) %>%
mutate(quantified = n / (base ^ exponent)) %>%
merge(prefix, by = "exponent", all.x = TRUE)
# TODO Threshold logic using filter and recombining with bind_rows
paste(
ifelse(q$exponent | is.decimal(q$n), sprintf("%.1f", q$quantified), q$n),
paste(q$prefix, suffix, sep = ""),
sep = sep)
}
})()
这似乎解决了原始版本中的重新排序错误:
> format.quantified(c(1,1000,1000000,1000000000))
[1] "1" "1.0k" "1.0M" "1.0G"
> format.quantified(123)
[1] "123"
然而,在我尝试实现阈值逻辑之前,我注意到当应用mutate
的数据帧时,输出中的排序完全搞砸了。经过仔细研究,结果表明,只要输入不是按数字顺序排列,稳定性就会失效:
> format.quantified(c(1000000,1,1000,1000000,1000000000))
[1] "1e+06" "1.0k" "1.0M" "1.0M" "1.0G"
我做错了什么?
编辑 FWIW,我的第二版功能的阈值逻辑如下:
q <- data.frame(n = as.numeric(n), exponent = trunc(log(n, base = base))) %>%
mutate(quantified = n / (base ^ exponent))
q.below <- filter(q, quantified < base * threshold)
q.above <- filter(q, quantified >= base * threshold) %>%
mutate(exponent = exponent + 1, quantified = quantified / base)
Q <- bind_rows(q.below, q.above) %>%
merge(prefix, by = "exponent", all.x = TRUE)
毋庸置疑,这并没有使订单稳定性问题更好!