我正在尝试以一种方式对列进行舍入,使得舍入值的总和与组中原始值的运行总和相匹配。
该任务的示例数据有三列:
这是一个数据样本,已按组内的ID排序:
numbers ids group
35.07209 1 1
27.50931 2 1
70.62019 3 1
99.55451 6 1
34.40472 8 1
17.58864 10 1
93.66178 4 3
83.21700 5 3
63.89058 7 3
88.96561 9 3
要生成用于测试的样本数据,我使用以下代码:
# Make data sample.
x.size <- 10^6
x <- list("numbers" = runif(x.size) * 100, "ids" = 1:x.size, "group" = ifelse(runif(x.size) > 0.2 ,1, ifelse(runif(x.size) > 0.8, 2, 3)))
x<- data.frame(x)
x <- x[order(x$group), ]
我写了一个函数来保持组内舍入的状态,以确保舍入值的总值是正确的:
makeRunRound <- function() {
# Data must be sorted by id.
cumDiff <- 0
savedId <- 0
function(x, id) {
# id here represents the group.
if(id != savedId) {
cumDiff <<- 0
savedId <<- id
}
xInt <- floor(x)
cumDiff <<- x - xInt + cumDiff
if(cumDiff > 1) {
xInt <- xInt + round(cumDiff)
cumDiff <<- cumDiff - round(cumDiff)
}
return (xInt)
}
}
runRound <- makeRunRound()
这种方法有效,如果不是速度,我会很高兴。
在1米记录样本上完成运行舍入需要2-3秒。
这对我来说太长了,there is another way explained in this question的速度提高了六倍。我保留了josliber的答案中给出的代码:
smartRound <- function(x) {
y <- floor(x)
indices <- tail(order(x-y), round(sum(x)) - sum(y))
y[indices] <- y[indices] + 1
y
}
使用上面代码生成的样本数据,基准测试:
# Code to benchmark speed.
library(microbenchmark)
res <- microbenchmark(
"run.df" = x$mrounded <- mapply(FUN=runRound, x$numbers, x$group),
"run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
"smart.df" = x$smart.round <- smartRound(x$numbers),
"smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
"silly" = x$silly.round <- round(x$numbers),
times = 50
)
print(res)
boxplot(res)
,产生以下结果:
Unit: milliseconds
expr min lq mean median uq max neval
run.df 3475.69545 3827.13649 3994.09184 3967.27759 4179.67702 4472.18679 50
run.dt 2449.05820 2633.52337 2895.51040 2881.87608 3119.42219 3617.67113 50
smart.df 488.70854 537.03179 576.57704 567.63077 611.81271 861.76436 50
smart.dt 390.35646 414.96749 468.95317 457.85820 507.54395 631.17081 50
silly 13.72486 15.82744 19.41796 17.19057 18.85385 88.06329 50
因此,对于考虑组内舍入值的运行总计的方法,速度从单元级舍入的20ms变为2.6s。
我已经包含了基于data.frame
和data.table
的计算比较,以证明没有重大差异,即使data.table
略微提高了效果。
我非常欣赏smartRound
的简单性和速度,但它不尊重项目的顺序,因此结果将与我需要的不同。
有办法:
smartRound
,以便在不失去效果的情况下获得与runRound
相同的结果?runRound
以提高效果?修改
dww回答给出了最快的解决方案:
diffRound <- function(x) {
diff(c(0, round(cumsum(x))))
}
我已将测试减少到四个选项:
res <- microbenchmark(
"silly" = x$silly.round <- round(x$numbers),
"diff(dww)" = smart.round.dt <- x.dt[, .(rounded = diffRound(numbers)), by = .(group)] ,
"smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
"run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
times = 50
)
新结果:
Unit: milliseconds
expr min lq mean median uq max neval
silly 14.67823 16.64882 17.31416 16.83338 17.67497 22.48689 50
diff(dww) 54.57762 70.11553 76.67135 71.37325 76.83717 139.18745 50
smart.dt 392.83240 408.65768 456.46592 441.33212 492.67824 592.57723 50
run.dt 2564.02724 2651.13994 2751.80516 2708.45317 2830.44553 3101.71005 50
多亏了dww,我的性能提升了6倍而且没有失去精度。
答案 0 :(得分:2)
我会这样做,使用简单的基矢量化函数:
首先计算原始数字的运行总数,以及该运行总数的舍入值。然后找到一个数字列表,使用diff()加总这个舍入的运行总计,看看每个舍入的和是如何大于最后一个。
cum.sum <- cumsum(x$numbers)
cum.sum.rounded <- round(cum.sum)
numbers.round <- diff(cum.sum.rounded)
numbers.round <- c(cum.sum.rounded[1], numbers.round)
检查所有内容是否符合您的要求:
check.cs <- cumsum(numbers.round)
all( abs(check.cs - cum.sum) <=1 )
#TRUE