让我们制作一个data.table:
dt <- data.table(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
x.1 x.2 x.3 vessel Year
1: 1 1 2 a 2012
2: 2 2 3 a 2013
3: 3 3 4 a 2014
4: 4 4 5 a 2015
5: 5 5 6 b 2012
6: 6 6 7 b 2013
7: 7 7 8 b 2014
8: 8 8 9 b 2015
我可以使用函数长度和总和来汇总它,以获得每年所有x的总和以及每年独特船只的总和,如下所示:
dt[,
list(
x.1=sum(x.1),
x.2=sum(x.2),
x.3=sum(x.3),
vessels=length(unique(vessel))),
by=list(Year=Year)]
Year x.1 x.2 x.3 vessels
1: 2012 6 6 8 2
2: 2013 8 8 10 2
3: 2014 10 10 12 2
4: 2015 12 12 14 2
这就是我想要的,但在我的真实数据中我有很多列,所以我想使用grep或%like%,但我无法让它工作。我正在考虑与此相符:
dt[,grep("x",colnames(dt)),with = FALSE])
但是如何将其与聚合合并?
答案 0 :(得分:5)
您可以使用lapply
在所有(.SD
)或多个列(使用.SDcols
选择)上应用函数:
dt[, lapply(.SD, sum), by=Year, .SDcols=c("x.1","x.2")]
以下内容也可能用于选择名称中包含“x”的所有列:
dt[, c(lapply(.SD, sum), vessel=uniqueN(vessel)),
by=Year,
.SDcols=grepl("^x", names(dt))
]
答案 1 :(得分:1)
如果您要汇总多列,可能值得考虑使用melt()
将数据从宽格式转换为长格式并使用dcast()
汇总:
molten <- melt(dt, id.vars = c("Year", "vessel"))
molten
# Year vessel variable value
# 1: 2012 a x.1 1
# 2: 2013 a x.1 2
# 3: 2014 a x.1 3
# 4: 2015 a x.1 4
# 5: 2012 b x.1 5
# ...
#19: 2014 a x.3 4
#20: 2015 a x.3 5
#21: 2012 b x.3 6
#22: 2013 b x.3 7
#23: 2014 b x.3 8
#24: 2015 b x.3 9
# Year vessel variable value
dcast(molten, Year ~ variable, sum)
# Year x.1 x.2 x.3
#1: 2012 6 6 8
#2: 2013 8 8 10
#3: 2014 10 10 12
#4: 2015 12 12 14
现在,每年的船只数量
dt[, .(vessels = uniqueN(vessel)), Year]
# Year vessels
#1: 2012 2
#2: 2013 2
#3: 2014 2
#4: 2015 2
最后需要使用 join 附加:
dcast(molten, Year ~ variable, sum)[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
# Year x.1 x.2 x.3 vessels
#1: 2012 6 6 8 2
#2: 2013 8 8 10 2
#3: 2014 10 10 12 2
#4: 2015 12 12 14 2
measure.vars
melt()
参数允许定义/选择/限制相关度量列。 subset
dcast()
参数允许选择特定的度量变量或排除dcast()
这允许做一些奇特的事情,如:
dcast(molten, Year ~ variable, list(mean, sum, max), subset = .(variable == "x.2")
)[dt[, .(vessels = uniqueN(vessel)), Year], on = "Year"]
# Year value_mean_x.2 value_sum_x.2 value_max_x.2 vessels
#1: 2012 3 6 5 2
#2: 2013 4 8 6 2
#3: 2014 5 10 7 2
#4: 2015 6 12 8 2
答案 2 :(得分:0)
如果您确实需要这样做才能提高效率:
> dt[, .SD
][, .N, .(vessel, Year)
][, .N, .(Year)
][, copy(dt)[.SD, vessels := i.N, on='Year']
][, vessel := NULL
][, melt(.SD, id.vars=c('Year', 'vessels'))
][, .(value=sum(value)), .(Year, vessels, variable)
][, dcast(.SD, ... ~ variable, value.var='value')
][, setcolorder(.SD, c(setdiff(colnames(.SD), 'vessels'), 'vessels'))
][order(Year)
]
Year x.1 x.2 x.3 vessels
1: 2012 6 6 8 2
2: 2013 8 8 10 2
3: 2014 10 10 12 2
4: 2015 12 12 14 2
>
答案 3 :(得分:-1)
我不能很好地解决你的问题,但你想用grep做什么可以解决这个问题
dt <- data.frame(x.1=1:8, x.2=1:8, x.3=2:9, vessel=rep(letters[1:2], each=4), Year=rep(2012:2015, 2))
dt
dt[unlist(lapply(colnames(dt),function(v){grepl("x",v)}))]
然后在您过滤的数据库上,您可以执行您想要的操作