关于按组分组R数据中的值的提示

时间:2013-01-29 19:18:16

标签: r dataframe

  

可能重复:
  Beginner tips on using plyr to calculate year-over-year change across groups

在多个变量组(即地区和食品)中计算现有数据框变量(即销售额)的年度差异(新变量)的好方法是什么?

以下是数据框架结构的示例:

Date              Region    Type    Sales

1/1/2001    East    Food    120
1/1/2001    West    Housing 130
1/1/2001    North   Food    130
1/2/2001    East    Food    133
1/3/2001    West    Housing 140
1/4/2001    North   Food    150
….
….
1/29/2013   East    Food    125
1/29/2013   West    Housing 137
1/29/2013   North   Food    1350

此外,除了区分数据外,我还想计算一个尾随(比如7天)的移动平均线。

非常感谢任何指导。

1 个答案:

答案 0 :(得分:3)

这是让你入门的东西。对于这类事情,data.table是一个很棒的软件包,因为它为这些事情提供了一种简洁易用的语法(一旦你超越了学习曲线)。

library(data.table)

创建可重现的示例

set.seed(128)
regions = c("East", "West", "North", "South")
types = c("Food", "Housing")
dates <- seq(as.Date('2009-01-01'), as.Date('2011-12-31'), by = 1)
n <- length(dates)
dt <- data.table(Date = dates, 
                 Region = sample(regions, n, replace = TRUE),
                 Type = sample(types, n, replace = TRUE),
                 Sales = round(rnorm(n, mean = 100, sd = 10)))

添加年份列

dt[, Year := year(Date)]

> dt
        Date Region    Type Sales Year
1: 2009-01-01   West    Food   119 2009
2: 2009-01-02  North Housing   102 2009
3: 2009-01-03  North Housing   102 2009
4: 2009-01-04  North    Food   101 2009
5: 2009-01-05   West    Food   101 2009
---                                     
1091: 2011-12-27   East Housing   122 2011
1092: 2011-12-28   East Housing    88 2011
1093: 2011-12-29  North    Food   115 2011
1094: 2011-12-30   West Housing    96 2011
1095: 2011-12-31   East    Food   101 2011

按年计算摘要

summary <- dt[, list(Sales = sum(Sales)), by = 'Year,Region,Type']
setkey(summary, 'Year')

> head(summary)
Year Region    Type Sales
1: 2009   West    Food  4791
2: 2009  North Housing  3517
3: 2009  North    Food  6774
4: 2009  South Housing  4380
5: 2009   East    Food  4144
6: 2009   West Housing  4275

为每个地区/产品组合创建年度差异的功能。

YoYdiff <- function(dt) {
  # Calculate year-on-year difference for Sales column
  data.table(Sales.Diff = diff(dt$Sales), Year = dt$Year[-1])
}

按列计算年度差异。这适用于我的示例,因为setkey(dt,Year)按年份对数据表进行排序,但如果您的示例在某些产品/区域中缺少某些年份,则必须更加小心。

> summary[, YoYdiff(.SD), by = 'Region,Type']
    Region    Type Sales.Diff Year
 1:   West    Food       -412 2010
 2:   West    Food        121 2011
 3:  North Housing       1907 2010
 4:  North Housing      -1457 2011
 5:  North    Food      -3087 2010
 6:  North    Food        369 2011
 7:  South Housing       -539 2010
 8:  South Housing        575 2011
 9:   East    Food       1264 2010
10:   East    Food      -1732 2011
11:   West Housing        298 2010
12:   West Housing       -410 2011
13:  South    Food       -889 2010
14:  South    Food       1045 2011
15:   East Housing       1146 2010
16:   East Housing       1169 2011