R中的“分组”因子观察

时间:2020-05-26 23:24:01

标签: r dplyr

我有一个与此结构类似的数据框

Year <- c("2000", "2001", "2002" ,"2003", "2004", "2005" ,"2006", "2007", "2008", "2009", "2010", "2011" ,"2012", "2013", "2014", "2015")
Sales <- c(2000,4800,6700,5000,7000,8000,3070,2000,1800,7100,6600,5000,6000,4200,1200,5700)
salesDF <- data.frame(Year,Sales)

Year列是一个因子变量。我想更改一个新的列,该列在Year列中具有观测值,以5年为间隔进行分组。最终,销售趋势是5年间隔的倍数。

我希望我的图例的间隔为"2000", "2005", "2010", "2015"

如何实现这一目标?

3 个答案:

答案 0 :(得分:6)

这是将cumsum和模数(%%)分组的一种简单方法:

salesDF %>% 
  mutate(Group = cumsum(as.numeric(as.character(salesDF$Year)) %% 5 == 0)) %>%
  group_by(Group) %>%
  summarize(Year = first(Year), Mean = mean(Sales), Sum = sum(Sales))
# A tibble: 4 x 4
  Group Year   Mean   Sum
  <int> <fct> <dbl> <dbl>
1     1 2000   5100 25500
2     2 2005   4394 21970
3     3 2010   4600 23000
4     4 2015   5700  5700

或者作为不进行总结的新列:

salesDF %>% 
  mutate(Group = cumsum(as.numeric(as.character(salesDF$Year)) %% 5 == 0)) %>%
  group_by(Group) %>%
  mutate(Mean = mean(Sales), Sum = sum(Sales))
# A tibble: 16 x 5
# Groups:   Group [4]
   Year  Sales Group  Mean   Sum
   <fct> <dbl> <int> <dbl> <dbl>
 1 2000   2000     1  5100 25500
 2 2001   4800     1  5100 25500
 3 2002   6700     1  5100 25500
...
14 2013   4200     3  4600 23000
15 2014   1200     3  4600 23000
16 2015   5700     4  5700  5700

答案 1 :(得分:3)

您可以使用cut / findInterval将数据分成5年一组。

library(dplyr)

salesDF %>%
  group_by(grp = findInterval(Year, seq(min(Year), max(Year), 5))) %>%
  summarise(Year = first(Year), Sales = sum(Sales)) %>%
  ungroup() %>%
  select(-grp)

# A tibble: 4 x 2
#  Year  Sales
#  <chr> <dbl>
#1 2000  25500
#2 2005  21970
#3 2010  23000
#4 2015   5700

或者在data.table

library(data.table)
setDT(salesDF)[, .(Year = first(Year), Sales = sum(Sales)), 
                  .(findInterval(Year, seq(min(Year), max(Year), 5)))]

数据

Year列更改为数字

salesDF$Year <- as.numeric(as.character(salesDF$Year))

答案 2 :(得分:-1)

Base R解决方案:

agg_sales <- data.frame(do.call("cbind", (aggregate(. ~ Year, 
                       within(salesDF, {Year <- floor(as.numeric(as.character(Year)) %/% 5) * 5}),
                       FUN = function(x) {
                         c(total_sales = sum(x, na.rm = TRUE), avg_sales = mean(x, na.rm = TRUE))}))))