我有一个客户清单,在过去的几个财政年度中,他们通过两种渠道(在线和离线)从他们那里获得收入。我希望有一个变量来显示每个客户上一年的总收入(在线+离线)。
样品数据如下所示,所需变量以黄色突出显示。计算显示在相邻列中。
我尝试按CustomerID和Fin Year进行分组,计算收入总和,并使用lag()函数获取上一年度的总收入,但这没用。
df %>% group_by(CustomerID, FinYear) %>% mutate(yearly_totalRevenue = sum(Revenue)) %>% mutate(lastyear_totalRevenue = lag(yearly_totalRevenue )) %>% ungroup()
注意:由于数据量在10M范围内,因此将高度赞赏内存效率高的代码(最好使用data.table功能)。
谢谢。
Edit1:添加了示例数据的dput()。
structure(list(CustomerID = c("Cust1", "Cust2", "Cust3", "Cust4",
"Cust5", "Cust1", "Cust2", "Cust3", "Cust4", "Cust5"), `Fin Year` =
c("2010/11",
"2011/12", "2012/13", "2013/14", "2014/15", "2010/11", "2011/12",
"2012/13", "2013/14", "2014/15"), Channel = c("Online", "Online",
"Online", "Online", "Online", "Offline", "Offline", "Offline",
"Offline", "Offline"), Revenue = c(858, 733, 248, 541, 222, 316,
412, 167, 385, 654)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
答案 0 :(得分:1)
您可以尝试:
setDT(df)[, yearly_totalRevenue := sum(Revenue), .(CustomerID, FinYear)][,
lastyear_totalRevenue := shift(yearly_totalRevenue), .(rowid(CustomerID))]
输出:
CustomerID FinYear Channel Revenue yearly_totalRevenue lastyear_totalRevenue
1: Cust1 2010/11 Online 858 1174 NA
2: Cust2 2011/12 Online 733 1145 1174
3: Cust3 2012/13 Online 248 415 1145
4: Cust4 2013/14 Online 541 926 415
5: Cust5 2014/15 Online 222 876 926
6: Cust1 2010/11 Offline 316 1174 NA
7: Cust2 2011/12 Offline 412 1145 1174
8: Cust3 2012/13 Offline 167 415 1145
9: Cust4 2013/14 Offline 385 926 415
10: Cust5 2014/15 Offline 654 876 926