R data.table:如何优化每个相应组的两个数据表之间的值差异的计算?

时间:2016-02-20 17:40:35

标签: r data.table

我有很多预订数据(数百万行),并且想要计算存储在两个单独数据表中的相同不同年份组之间的预订金额的变化(差异=减法)。

我可以使用伟大的data.table来实现这一点,如下面的代码所示,但是如何优化代码(关于性能和内存消耗),因为我正在调整数据(表)并且有可能一次完成的几个计算步骤?

# Calculate value differences for the same group of data in two different data.tables
cur <- data.table(company=c("A", "B", "New"), booking.date=seq(from=as.Date("2011/01/01"), by="week", length.out=12), sales.amount = 201:212, vat.amount = 11:22)
cur

prev <- data.table(company=c("A", "B"), booking.date=seq(from=as.Date("2010/01/01"), by="month", length.out=10), sales.amount = 101:110, vat.amount = 1:10)
prev

diff <- copy(prev)   # copy to keep the original data.table unchanged
diff[, `:=`(sales.amount = -sales.amount, vat.amount = -vat.amount)]   # negate the amounts so that the sum will be the difference
diff <- rbind(diff, cur)  # combine negative previous amounts with positive current amounts so that the sum will be difference
diff  # show raw data
diff[, .(last.booking.date=max(booking.date), sales.amount.diff=sum(sales.amount), vat.amount.diff=sum(vat.amount)), by=company] # calculate the difference

# Look at company "A" to verify the result:
cur[company=="A",]
prev[company=="A",]

示例数据和预期输出如下所示:

数据表1:当年的预订:

> cur
    company booking.date sales.amount vat.amount
 1:       A   2011-01-01          201         11
 2:       B   2011-01-08          202         12
 3:     New   2011-01-15          203         13
 4:       A   2011-01-22          204         14
 5:       B   2011-01-29          205         15
 6:     New   2011-02-05          206         16
 7:       A   2011-02-12          207         17
 8:       B   2011-02-19          208         18
 9:     New   2011-02-26          209         19
10:       A   2011-03-05          210         20
11:       B   2011-03-12          211         21
12:     New   2011-03-19          212         22

数据表2:去年的预订:

> prev
   company booking.date sales.amount vat.amount
 1:       A   2010-01-01          101          1
 2:       B   2010-02-01          102          2
 3:       A   2010-03-01          103          3
 4:       B   2010-04-01          104          4
 5:       A   2010-05-01          105          5
 6:       B   2010-06-01          106          6
 7:       A   2010-07-01          107          7
 8:       B   2010-08-01          108          8
 9:       A   2010-09-01          109          9
10:       B   2010-10-01          110         10

预期结果(每家公司每个预订年度的差异):

   company last.booking.date sales.amount.diff vat.amount.diff
1:     A 1        2011-03-05               297              37
2:     B 1        2011-03-12               296              36
3:   New 1        2011-03-19               830              70

2 个答案:

答案 0 :(得分:5)

@Jaap的好方法

将原始表绑定在一起的另一种方法可能是:

# aggregate tables by company
cur_co <- cur[, .(last.booking.date = max(booking.date),
                  sales.amount = sum(sales.amount),
                  vat.amount   = sum(vat.amount)),
              by=company]

prev_co <- prev[, .(sales.amount = sum(sales.amount),
                    vat.amount = sum(vat.amount)),
                by=company]


# join & get difference
cur_co[prev_co, c("sales.amount.diff", "vat.amount.diff") :=
           .(sales.amount - i.sales.amount, vat.amount - i.vat.amount),
       on="company"]

# fill NA's (companies missing in previuos year)
cur_co[is.na(sales.amount.diff),
         c("sales.amount.diff", "vat.amount.diff") :=
           .(sales.amount, vat.amount)]

# drop unused columns
cur_co[, c("sales.amount", "vat.amount") := NULL]

给出完全相同的输出:

   company last.booking.date sales.amount.diff vat.amount.diff
1:       A        2011-03-05               297              37
2:       B        2011-03-12               296              36
3:     New        2011-03-19               830              70

答案 1 :(得分:4)

这可能是将原始数据表绑定在一起然后进行计算的最简单方法:

# bind the data.table's together into one
dt.all <- rbindlist(list(cur,prev))
# set the key to 'company' and 'booking.date'
# the data.table is now also ordered by these two columns
setkey(dt.all, company, booking.date)

dt.all[, .(last.booking.date = booking.date[.N],
           sales.amount.diff = sum(sales.amount[year(booking.date)==2011]) - sum(sales.amount[year(booking.date)==2010]),
           vat.amount.diff = sum(vat.amount[year(booking.date)==2011]) - sum(vat.amount[year(booking.date)==2010])),
       company]

给出:

   company last.booking.date sales.amount.diff vat.amount.diff
1:       A        2011-03-05               297              37
2:       B        2011-03-12               296              36
3:     New        2011-03-19               830              70

因为当你有多年时,一个更好的方法可能是:

dt.all[, .(last.booking.date = booking.date[.N],
           sum.sales = sum(sales.amount),
           sum.vat = sum(vat.amount)),
       .(company, year(booking.date))
       ][, `:=` (last.booking.date = last.booking.date[.N],
                 sales.amount.diff = sum.sales - shift(sum.sales),
                 vat.amount.diff = sum.vat - shift(sum.vat)),
         company][]

给出:

   company year last.booking.date sum.sales sum.vat sales.amount.diff vat.amount.diff
1:       A 2010        2011-03-05       525      25                NA              NA
2:       A 2011        2011-03-05       822      62               297              37
3:       B 2010        2011-03-12       530      30                NA              NA
4:       B 2011        2011-03-12       826      66               296              36
5:     New 2011        2011-03-19       830      70                NA              NA

fill = 0添加到shift参数将导致:

   company year last.booking.date sum.sales sum.vat sales.amount.diff vat.amount.diff
1:       A 2010        2011-03-05       525      25               525              25
2:       A 2011        2011-03-05       822      62               297              37
3:       B 2010        2011-03-12       530      30               530              30
4:       B 2011        2011-03-12       826      66               296              36
5:     New 2011        2011-03-19       830      70               830              70