拥有如下所示的数据框:
d
year pos days sal
1 2009 A 31 2000
2 2009 B 60 4000
3 2009 C 10 600
4 2010 B 10 1000
5 2010 D 90 7000
我想按year
对数据进行分组,添加days
和sal
,然后选择pos
,其中days
在组中最大。
结果应该是:
year pos days sal
1 2009 B 101 6600
2 2010 D 100 8000
我可以使用days
等函数处理sal
和tapply(d$days, d$year, sum)
等数值。
但是,我不知道如何选择满足日期条件的pos并将其分配给组。
任何评论都将不胜感激!
答案 0 :(得分:2)
我们可以使用dplyr
。按“年份”分组后,获取“天”最大值(which.max(days)
)的'pos',以及'days'和'sal'的sum
。
library(dplyr)
d %>%
group_by(year) %>%
summarise(pos = pos[which.max(days)], days = sum(days), sal = sum(sal))
# # A tibble: 2 × 4
# year pos days sal
# <int> <chr> <int> <int>
#1 2009 B 101 6600
#2 2010 D 100 8000
答案 1 :(得分:1)
基础R的解决方案:
m1 <- d[as.logical(with(d, ave(days, year, FUN = function(x) seq_along(x) == which.max(x)) )), c('year','pos')]
m2 <- aggregate(cbind(days, sal) ~ year, d, sum)
merge(m1, m2, by = 'year')
或使用 data.table 包:
library(data.table)
setDT(d)[order(days), .(pos = pos[.N], days = sum(days), sal = sum(sal)), by = year]
生成的data.frame / data.table:
year pos days sal
1 2009 B 101 6600
2 2010 D 100 8000
答案 2 :(得分:0)
使用sqldf:
library(sqldf)
cbind.data.frame(sqldf('select year, sum(days) as days, sum(sal) as sal
from d group by year'),
sqldf('select pos from d group by year having days=max(days)'))
year days sal pos
1 2009 101 6600 B
2 2010 100 8000 D