我有很多预订数据(数百万行),并且想要计算存储在两个单独数据表中的相同不同年份组之间的预订金额的变化(差异=减法)。
我可以使用伟大的data.table来实现这一点,如下面的代码所示,但是如何优化代码(关于性能和内存消耗),因为我正在调整数据(表)并且有可能一次完成的几个计算步骤?
# Calculate value differences for the same group of data in two different data.tables
cur <- data.table(company=c("A", "B", "New"), booking.date=seq(from=as.Date("2011/01/01"), by="week", length.out=12), sales.amount = 201:212, vat.amount = 11:22)
cur
prev <- data.table(company=c("A", "B"), booking.date=seq(from=as.Date("2010/01/01"), by="month", length.out=10), sales.amount = 101:110, vat.amount = 1:10)
prev
diff <- copy(prev) # copy to keep the original data.table unchanged
diff[, `:=`(sales.amount = -sales.amount, vat.amount = -vat.amount)] # negate the amounts so that the sum will be the difference
diff <- rbind(diff, cur) # combine negative previous amounts with positive current amounts so that the sum will be difference
diff # show raw data
diff[, .(last.booking.date=max(booking.date), sales.amount.diff=sum(sales.amount), vat.amount.diff=sum(vat.amount)), by=company] # calculate the difference
# Look at company "A" to verify the result:
cur[company=="A",]
prev[company=="A",]
示例数据和预期输出如下所示:
数据表1:当年的预订:
> cur
company booking.date sales.amount vat.amount
1: A 2011-01-01 201 11
2: B 2011-01-08 202 12
3: New 2011-01-15 203 13
4: A 2011-01-22 204 14
5: B 2011-01-29 205 15
6: New 2011-02-05 206 16
7: A 2011-02-12 207 17
8: B 2011-02-19 208 18
9: New 2011-02-26 209 19
10: A 2011-03-05 210 20
11: B 2011-03-12 211 21
12: New 2011-03-19 212 22
数据表2:去年的预订:
> prev
company booking.date sales.amount vat.amount
1: A 2010-01-01 101 1
2: B 2010-02-01 102 2
3: A 2010-03-01 103 3
4: B 2010-04-01 104 4
5: A 2010-05-01 105 5
6: B 2010-06-01 106 6
7: A 2010-07-01 107 7
8: B 2010-08-01 108 8
9: A 2010-09-01 109 9
10: B 2010-10-01 110 10
预期结果(每家公司每个预订年度的差异):
company last.booking.date sales.amount.diff vat.amount.diff
1: A 1 2011-03-05 297 37
2: B 1 2011-03-12 296 36
3: New 1 2011-03-19 830 70
答案 0 :(得分:5)
@Jaap的好方法
将原始表绑定在一起的另一种方法可能是:
# aggregate tables by company
cur_co <- cur[, .(last.booking.date = max(booking.date),
sales.amount = sum(sales.amount),
vat.amount = sum(vat.amount)),
by=company]
prev_co <- prev[, .(sales.amount = sum(sales.amount),
vat.amount = sum(vat.amount)),
by=company]
# join & get difference
cur_co[prev_co, c("sales.amount.diff", "vat.amount.diff") :=
.(sales.amount - i.sales.amount, vat.amount - i.vat.amount),
on="company"]
# fill NA's (companies missing in previuos year)
cur_co[is.na(sales.amount.diff),
c("sales.amount.diff", "vat.amount.diff") :=
.(sales.amount, vat.amount)]
# drop unused columns
cur_co[, c("sales.amount", "vat.amount") := NULL]
给出完全相同的输出:
company last.booking.date sales.amount.diff vat.amount.diff
1: A 2011-03-05 297 37
2: B 2011-03-12 296 36
3: New 2011-03-19 830 70
答案 1 :(得分:4)
这可能是将原始数据表绑定在一起然后进行计算的最简单方法:
# bind the data.table's together into one
dt.all <- rbindlist(list(cur,prev))
# set the key to 'company' and 'booking.date'
# the data.table is now also ordered by these two columns
setkey(dt.all, company, booking.date)
dt.all[, .(last.booking.date = booking.date[.N],
sales.amount.diff = sum(sales.amount[year(booking.date)==2011]) - sum(sales.amount[year(booking.date)==2010]),
vat.amount.diff = sum(vat.amount[year(booking.date)==2011]) - sum(vat.amount[year(booking.date)==2010])),
company]
给出:
company last.booking.date sales.amount.diff vat.amount.diff
1: A 2011-03-05 297 37
2: B 2011-03-12 296 36
3: New 2011-03-19 830 70
因为当你有多年时,一个更好的方法可能是:
dt.all[, .(last.booking.date = booking.date[.N],
sum.sales = sum(sales.amount),
sum.vat = sum(vat.amount)),
.(company, year(booking.date))
][, `:=` (last.booking.date = last.booking.date[.N],
sales.amount.diff = sum.sales - shift(sum.sales),
vat.amount.diff = sum.vat - shift(sum.vat)),
company][]
给出:
company year last.booking.date sum.sales sum.vat sales.amount.diff vat.amount.diff
1: A 2010 2011-03-05 525 25 NA NA
2: A 2011 2011-03-05 822 62 297 37
3: B 2010 2011-03-12 530 30 NA NA
4: B 2011 2011-03-12 826 66 296 36
5: New 2011 2011-03-19 830 70 NA NA
将fill = 0
添加到shift
参数将导致:
company year last.booking.date sum.sales sum.vat sales.amount.diff vat.amount.diff
1: A 2010 2011-03-05 525 25 525 25
2: A 2011 2011-03-05 822 62 297 37
3: B 2010 2011-03-12 530 30 530 30
4: B 2011 2011-03-12 826 66 296 36
5: New 2011 2011-03-19 830 70 830 70