R使用data.table计算依赖于先前行

时间:2018-04-19 21:06:38

标签: r performance for-loop data.table rcpp

我有多个与日常销售相关的产品。我想根据每种产品的累计销售额和我期望在一段时间内销售的总销售额来预测这些产品的预期日销售额。

第一个表格(“关键字”)具有每个产品的预期总销售额,以及我预测每天销售的数量(根据已售出的数量)(即,我的产品A的累计销售额是650,我已售出1500总额的43%,因此预计第二天会卖出75,因为40%<= 43%<60%。

我想根据预测的销售量更新每个产品的第二个表(“数据”)累计销售额。预测的数量取决于前一期间的累计销售额,这意味着我无法独立计算每一列,因此我认为我需要使用循环。

然而,我的数据库有超过500,000行,我使用for循环的最佳尝试是太慢而不可行。思考?我认为Rcpp实现可能是一个潜在的解决方案,但我之前没有使用过该包或C ++。所需的最终答案如下所示(“最终”)。

library(data.table)
key <- data.table(Product = c(rep("A",5), rep("B",5)), TotalSales = 
c(rep(1500,5),rep(750,5)), Percent = rep(seq(0.2, 1, 0.2),2), Forecast = 
c(seq(125, 25, -25), seq(75, 15, -15)))

data <- data.table(Date = rep(seq(1, 9, 1), 2), Product=rep(c("A", "B"), 
each=9L), Time = rep(c(rep("Past",4), rep("Future",5)),2), Sales = c(190, 
165, 133, 120, 0, 0, 0, 0, 0, 72, 58, 63, 51, 0, 0, 0, 0, 0))

final <- data.table(data, Cum = c(190, 355, 488, 608, 683, 758, 833, 908, 
958, 72, 130, 193, 244, 304, 349, 394, 439, 484), Percent.Actual = c(0.13, 
0.24, 0.33, 0.41, 0.46, 0.51, 0.56, 0.61, 0.64, 0.10, 0.17, 0.26, 0.33, 
0.41, 0.47, 0.53, 0.59, 0.65), Forecast = c(0, 0, 0, 0, 75, 75, 75, 75, 50, 
0, 0, 0, 0, 60, 45, 45, 45, 45))

1 个答案:

答案 0 :(得分:1)

Not sure if this is really going to help with your actual dataset given the size.

library(data.table)

#convert key into a list for fast loookup
keyLs <- lapply(split(key, by="Product"), 
    function(x) list(TotalSales=x[,TotalSales[1L]], 
                     Percent=x[,Percent], 
                     Forecast=x[,Forecast]))

#for each product, use recursion to calculate cumulative sales after finding the forecasted sales
futureSales <- data[, {
        byChar <- as.character(.BY)
        list(Date=Date[Time=="Future"], 
            Cum=Reduce(function(x, y) {
                pct <- x / keyLs[[byChar]]$TotalSales
                res <- x + keyLs[[byChar]]$Forecast[findInterval(pct, c(0, keyLs[[byChar]]$Percent))]
                if (res >= keyLs[[byChar]]$TotalSales) return(keyLs[[byChar]]$TotalSales)
                res
            },
            x=rep(0L, sum(Time=="Future")),
            init=sum(Sales[Time=="Past"]),
            accumulate=TRUE)[-1])
    },
    by=.(Product)]
futureSales 

#calculate other sales stats
futureSales[data, on=.(Date, Product)][,
    Cum := ifelse(is.na(Cum), cumsum(Sales), Cum),
    by=.(Product)][,
        ':=' (
            Percent.Actual = Cum / keyLs[[as.character(.BY)]]$TotalSales,
            Forecast = ifelse(Sales > 0, 0, c(0, diff(Cum)))
        ), by=.(Product)][]
#     Product Date Cum   Time Sales Percent.Actual Forecast
#  1:       A    1 190   Past   190      0.1266667        0
#  2:       A    2 355   Past   165      0.2366667        0
#  3:       A    3 488   Past   133      0.3253333        0
#  4:       A    4 608   Past   120      0.4053333        0
#  5:       A    5 683 Future     0      0.4553333       75
#  6:       A    6 758 Future     0      0.5053333       75
#  7:       A    7 833 Future     0      0.5553333       75
#  8:       A    8 908 Future     0      0.6053333       75
#  9:       A    9 958 Future     0      0.6386667       50
# 10:       B    1  72   Past    72      0.0960000        0
# 11:       B    2 130   Past    58      0.1733333        0
# 12:       B    3 193   Past    63      0.2573333        0
# 13:       B    4 244   Past    51      0.3253333        0
# 14:       B    5 304 Future     0      0.4053333       60
# 15:       B    6 349 Future     0      0.4653333       45
# 16:       B    7 394 Future     0      0.5253333       45
# 17:       B    8 439 Future     0      0.5853333       45
# 18:       B    9 484 Future     0      0.6453333       45

You might also want to consider running your calculation in parallel by product.