使用sum,length和grep聚合data.table

时间:2017-05-15 09:55:59

标签: r grep data.table aggregate

让我们制作一个data.table:

dt <- data.table(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
   x.1 x.2 x.3 vessel Year
1:   1   1   2      a 2012
2:   2   2   3      a 2013
3:   3   3   4      a 2014
4:   4   4   5      a 2015
5:   5   5   6      b 2012
6:   6   6   7      b 2013
7:   7   7   8      b 2014
8:   8   8   9      b 2015

我可以使用函数长度和总和来汇总它,以获得每年所有x的总和以及每年独特船只的总和,如下所示:

dt[, 
            list(
  x.1=sum(x.1),
  x.2=sum(x.2),
  x.3=sum(x.3),
  vessels=length(unique(vessel))),
    by=list(Year=Year)]

   Year x.1 x.2 x.3 vessels
1: 2012   6   6   8       2
2: 2013   8   8  10       2
3: 2014  10  10  12       2
4: 2015  12  12  14       2

这就是我想要的,但在我的真实数据中我有很多列,所以我想使用grep或%like%,但我无法让它工作。我正在考虑与此相符:

dt[,grep("x",colnames(dt)),with = FALSE])

但是如何将其与聚合合并?

4 个答案:

答案 0 :(得分:5)

您可以使用lapply在所有(.SD)或多个列(使用.SDcols选择)上应用函数:

dt[, lapply(.SD, sum), by=Year, .SDcols=c("x.1","x.2")]

以下内容也可能用于选择名称中包含“x”的所有列:

dt[, c(lapply(.SD, sum), vessel=uniqueN(vessel)),
    by=Year,
    .SDcols=grepl("^x", names(dt))
]

答案 1 :(得分:1)

如果您要汇总多列,可能值得考虑使用melt()将数据从宽格式转换为长格式并使用dcast()汇总:

molten <- melt(dt, id.vars = c("Year", "vessel"))

molten
#    Year vessel variable value
# 1: 2012      a      x.1     1
# 2: 2013      a      x.1     2
# 3: 2014      a      x.1     3
# 4: 2015      a      x.1     4
# 5: 2012      b      x.1     5
# ...
#19: 2014      a      x.3     4
#20: 2015      a      x.3     5
#21: 2012      b      x.3     6
#22: 2013      b      x.3     7
#23: 2014      b      x.3     8
#24: 2015      b      x.3     9
#    Year vessel variable value

dcast(molten, Year ~ variable, sum)
#   Year x.1 x.2 x.3
#1: 2012   6   6   8
#2: 2013   8   8  10
#3: 2014  10  10  12
#4: 2015  12  12  14 

现在,每年的船只数量

dt[, .(vessels = uniqueN(vessel)), Year]
#   Year vessels
#1: 2012       2
#2: 2013       2
#3: 2014       2
#4: 2015       2

最后需要使用 join 附加:

dcast(molten, Year ~ variable, sum)[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
#   Year x.1 x.2 x.3 vessels
#1: 2012   6   6   8       2
#2: 2013   8   8  10       2
#3: 2014  10  10  12       2
#4: 2015  12  12  14       2

提示

  • measure.vars melt()参数允许定义/选择/限制相关度量列。
  • subset dcast()参数允许选择特定的度量变量或排除
  • 您可以在dcast()
  • 中使用多个聚合函数

这允许做一些奇特的事情,如:

dcast(molten, Year ~ variable, list(mean, sum, max), subset = .(variable == "x.2")
      )[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
#   Year value_mean_x.2 value_sum_x.2 value_max_x.2 vessels
#1: 2012              3             6             5       2
#2: 2013              4             8             6       2
#3: 2014              5            10             7       2
#4: 2015              6            12             8       2

答案 2 :(得分:0)

如果您确实需要这样做才能提高效率:

> dt[, .SD
     ][, .N, .(vessel, Year)
     ][, .N, .(Year)
     ][, copy(dt)[.SD, vessels := i.N, on='Year']
     ][, vessel := NULL
     ][, melt(.SD, id.vars=c('Year', 'vessels'))
     ][, .(value=sum(value)), .(Year, vessels, variable)
     ][, dcast(.SD, ... ~ variable, value.var='value')
     ][, setcolorder(.SD, c(setdiff(colnames(.SD), 'vessels'), 'vessels'))
     ][order(Year)
     ]

   Year x.1 x.2 x.3 vessels
1: 2012   6   6   8       2
2: 2013   8   8  10       2
3: 2014  10  10  12       2
4: 2015  12  12  14       2
> 

答案 3 :(得分:-1)

我不能很好地解决你的问题,但你想用grep做什么可以解决这个问题

dt <- data.frame(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
dt[unlist(lapply(colnames(dt),function(v){grepl("x",v)}))]

然后在您过滤的数据库上,您可以执行您想要的操作