跑四舍五入

时间:2016-05-01 17:50:15

标签: r rounding running-total

我正在尝试以一种方式对列进行舍入,使得舍入值的总和与组中原始值的运行总和相匹配。

该任务的示例数据有三列:

  • 数字 - 我需要舍入的值;
  • ids - 定义值的顺序,可以是时间序列数据的日期;
  • group - 定义我需要对数字进行舍入的组。

这是一个数据样本,已按组内的ID排序:

       numbers  ids group
       35.07209 1   1
       27.50931 2   1
       70.62019 3   1
       99.55451 6   1
       34.40472 8   1
       17.58864 10  1
       93.66178 4   3
       83.21700 5   3
       63.89058 7   3
       88.96561 9   3

要生成用于测试的样本数据,我使用以下代码:

  # Make data sample.
  x.size <- 10^6
  x <- list("numbers" = runif(x.size) * 100, "ids" = 1:x.size, "group" = ifelse(runif(x.size) > 0.2 ,1, ifelse(runif(x.size) > 0.8, 2, 3)))
  x<- data.frame(x)
  x <- x[order(x$group), ]

我写了一个函数来保持组内舍入的状态,以确保舍入值的总值是正确的:

makeRunRound <- function() {
  # Data must be sorted by id.
  cumDiff <- 0
  savedId <- 0

  function(x, id) {
  # id here represents the group.

    if(id != savedId) {
      cumDiff <<- 0
      savedId <<- id
    }

    xInt <- floor(x)
    cumDiff <<- x - xInt + cumDiff

    if(cumDiff > 1) {
      xInt <- xInt + round(cumDiff)
      cumDiff <<- cumDiff - round(cumDiff)
    }
    return (xInt)
  }
}

runRound <- makeRunRound()

这种方法有效,如果不是速度,我会很高兴。

在1米记录样本上完成运行舍入需要2-3秒。

这对我来说太长了,there is another way explained in this question的速度提高了六倍。我保留了josliber的答案中给出的代码:

smartRound <- function(x) {
  y <- floor(x)
  indices <- tail(order(x-y), round(sum(x)) - sum(y))
  y[indices] <- y[indices] + 1
  y
}

使用上面代码生成的样本数据,基准测试:

# Code to benchmark speed.
library(microbenchmark)
res <- microbenchmark(
  "run.df" = x$mrounded <- mapply(FUN=runRound, x$numbers, x$group),
  "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
  "smart.df" = x$smart.round <- smartRound(x$numbers),
  "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
  "silly" = x$silly.round <- round(x$numbers),
  times = 50
)
print(res)
boxplot(res)

,产生以下结果:

Benchmark for different rounding methods

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval
   run.df 3475.69545 3827.13649 3994.09184 3967.27759 4179.67702 4472.18679    50
   run.dt 2449.05820 2633.52337 2895.51040 2881.87608 3119.42219 3617.67113    50
 smart.df  488.70854  537.03179  576.57704  567.63077  611.81271  861.76436    50
 smart.dt  390.35646  414.96749  468.95317  457.85820  507.54395  631.17081    50
    silly   13.72486   15.82744   19.41796   17.19057   18.85385   88.06329    50

因此,对于考虑组内舍入值的运行总计的方法,速度从单元级舍入的20ms变为2.6s。

我已经包含了基于data.framedata.table的计算比较,以证明没有重大差异,即使data.table略微提高了效果。

我非常欣赏smartRound的简单性和速度,但它不尊重项目的顺序,因此结果将与我需要的不同。

有办法:

  • 或者修改smartRound,以便在不失去效果的情况下获得与runRound相同的结果?
  • 或者,修改runRound以提高效果?
  • 或者,是否还有其他更好的解决方案?

修改

dww回答给出了最快的解决方案:

diffRound <- function(x) { 
  diff(c(0, round(cumsum(x)))) 
}

我已将测试减少到四个选项:

res <- microbenchmark(
  "silly" = x$silly.round <- round(x$numbers),
  "diff(dww)" = smart.round.dt <- x.dt[, .(rounded = diffRound(numbers)), by = .(group)] ,
  "smart.dt"= smart.round.dt <- x.dt[, .(rounded = smartRound(numbers)), by = .(group)],
  "run.dt" = u <- x.dt[, .(rounded = runRound(numbers, group)), by = .(group, ids)],
  times = 50
)

新结果:

Updated Benchmark

Unit: milliseconds
      expr        min         lq       mean     median         uq        max neval
     silly   14.67823   16.64882   17.31416   16.83338   17.67497   22.48689    50
 diff(dww)   54.57762   70.11553   76.67135   71.37325   76.83717  139.18745    50
  smart.dt  392.83240  408.65768  456.46592  441.33212  492.67824  592.57723    50
    run.dt 2564.02724 2651.13994 2751.80516 2708.45317 2830.44553 3101.71005    50

多亏了dww,我的性能提升了6倍而且没有失去精度。

1 个答案:

答案 0 :(得分:2)

我会这样做,使用简单的基矢量化函数:

首先计算原始数字的运行总数,以及该运行总数的舍入值。然后找到一个数字列表,使用diff()加总这个舍入的运行总计,看看每个舍入的和是如何大于最后一个。

cum.sum <- cumsum(x$numbers)
cum.sum.rounded <- round(cum.sum)
numbers.round <- diff(cum.sum.rounded)
numbers.round <- c(cum.sum.rounded[1], numbers.round)

检查所有内容是否符合您的要求:

check.cs <- cumsum(numbers.round)
all( abs(check.cs - cum.sum) <=1 )
#TRUE