我有一个与此结构类似的数据框
Year <- c("2000", "2001", "2002" ,"2003", "2004", "2005" ,"2006", "2007", "2008", "2009", "2010", "2011" ,"2012", "2013", "2014", "2015")
Sales <- c(2000,4800,6700,5000,7000,8000,3070,2000,1800,7100,6600,5000,6000,4200,1200,5700)
salesDF <- data.frame(Year,Sales)
Year
列是一个因子变量。我想更改一个新的列,该列在Year列中具有观测值,以5年为间隔进行分组。最终,销售趋势是5年间隔的倍数。
我希望我的图例的间隔为"2000", "2005", "2010", "2015"
如何实现这一目标?
答案 0 :(得分:6)
这是将cumsum
和模数(%%
)分组的一种简单方法:
salesDF %>%
mutate(Group = cumsum(as.numeric(as.character(salesDF$Year)) %% 5 == 0)) %>%
group_by(Group) %>%
summarize(Year = first(Year), Mean = mean(Sales), Sum = sum(Sales))
# A tibble: 4 x 4
Group Year Mean Sum
<int> <fct> <dbl> <dbl>
1 1 2000 5100 25500
2 2 2005 4394 21970
3 3 2010 4600 23000
4 4 2015 5700 5700
或者作为不进行总结的新列:
salesDF %>%
mutate(Group = cumsum(as.numeric(as.character(salesDF$Year)) %% 5 == 0)) %>%
group_by(Group) %>%
mutate(Mean = mean(Sales), Sum = sum(Sales))
# A tibble: 16 x 5
# Groups: Group [4]
Year Sales Group Mean Sum
<fct> <dbl> <int> <dbl> <dbl>
1 2000 2000 1 5100 25500
2 2001 4800 1 5100 25500
3 2002 6700 1 5100 25500
...
14 2013 4200 3 4600 23000
15 2014 1200 3 4600 23000
16 2015 5700 4 5700 5700
答案 1 :(得分:3)
您可以使用cut
/ findInterval
将数据分成5年一组。
library(dplyr)
salesDF %>%
group_by(grp = findInterval(Year, seq(min(Year), max(Year), 5))) %>%
summarise(Year = first(Year), Sales = sum(Sales)) %>%
ungroup() %>%
select(-grp)
# A tibble: 4 x 2
# Year Sales
# <chr> <dbl>
#1 2000 25500
#2 2005 21970
#3 2010 23000
#4 2015 5700
或者在data.table
library(data.table)
setDT(salesDF)[, .(Year = first(Year), Sales = sum(Sales)),
.(findInterval(Year, seq(min(Year), max(Year), 5)))]
数据
将Year
列更改为数字
salesDF$Year <- as.numeric(as.character(salesDF$Year))
答案 2 :(得分:-1)
Base R解决方案:
agg_sales <- data.frame(do.call("cbind", (aggregate(. ~ Year,
within(salesDF, {Year <- floor(as.numeric(as.character(Year)) %/% 5) * 5}),
FUN = function(x) {
c(total_sales = sum(x, na.rm = TRUE), avg_sales = mean(x, na.rm = TRUE))}))))