R累计和因子总和'重置'

时间:2014-10-28 13:11:25

标签: r cumulative-sum

我的问题是我试图找到按季节(DJF,MAM,JJA,SON)和年份(1926 - 2000)的累积降雨量,每个季节结束时总和重置为零。

我已经设法使用代码

逐年完成
rainfall$yearly.cumsum=unlist(tapply(rainfall$RR, rainfall$year, FUN=cumsum))

并尝试使用

来适应季节
rainfall$seasonal.cumsum=unlist(tapply(rainfall$RR, .(season,year), transform, FUN=cumsum))

这会返回错误

Error in unique.default(x, nmax = nmax) : 
unique() applies only to vectors

我也试过这个:

rainfall$seasonal.cumsum=unlist(tapply(rainfall$RR, rainfall$season, FUN=cumsum))

这更有希望,因为它确实按季节增加,但在季节变化时不会重置。也就是说,我认为代码是每年每年总结DJF,然后每年进入MAM,然后是JJA,最后是SON,而不是DJF一年,重置,MAM为同年,重置等。

这是数据框的一部分。请注意,annual.cumsum正在对RR列中的值求和,但不包括seasonal.cumsum。

    DATE  year   month season RR   yearly.cumsum   seasonal.cumsum
 19260529 1926    05    MAM    0          2347            2518
 19260530 1926    05    MAM    0          2347            2518
 19260531 1926    05    MAM    9          2356            2530
 19260601 1926    06    JJA    0          2356            2530
 19260602 1926    06    JJA    3          2359            2530
 19260603 1926    06    JJA   71          2430            2530
 19260604 1926    06    JJA    0          2430            2530
 19260605 1926    06    JJA   48          2478            2534

我希望我的问题足够明确!

感谢。

3 个答案:

答案 0 :(得分:2)

您可以尝试dplyr

library(dplyr)
rainfall %>% 
         group_by(season, year) %>%
         mutate(seasonal.cumsum=cumsum(RR))

#          DATE year month season RR yearly.cumsum seasonal.cumsum
#1 19260529 1926     5    MAM  0          2347               0
#2 19260530 1926     5    MAM  0          2347               0
#3 19260531 1926     5    MAM  9          2356               9
#4 19260601 1926     6    JJA  0          2356               0
#5 19260602 1926     6    JJA  3          2359               3
#6 19260603 1926     6    JJA 71          2430              74
#7 19260604 1926     6    JJA  0          2430              74
#8 19260605 1926     6    JJA 48          2478             122

更新

关于创建跨越一年的连续月份,您可以尝试这一点(此处,此重置在3月1日,开始新的一年)

 indx <- rainfall2$year-min(rainfall2$year) + rainfall2$month %in% c(1,2,12)
 indx1 <- cumsum(c(TRUE,diff(indx) <0))
 rainfall2$year2 <- indx1+ (min(rainfall$year))

 res <-  rainfall2 %>%
                   group_by(season, year2) %>%
                   mutate(seasonal.cumsum=cumsum(RR))

 do.call(rbind,lapply(split(res, res$year2), head,2))
 #       DATE month year season  RR year2 seasonal.cumsum
 #1 19260504     5 1926    MAM  50  1927              50
 #2 19260505     5 1926    MAM  84  1927             134
 #3 19270301     3 1927    MAM  98  1928              98
 #4 19270302     3 1927    MAM 112  1928             210
 #5 19280301     3 1928    MAM  91  1929              91
 #6 19280302     3 1928    MAM  85  1929             176
 #7 19290301     3 1929    MAM  18  1930              18
 #8 19290302     3 1929    MAM 111  1930             129

UPDATE2

如果您需要在12月1日重置年份

 indx <- rainfall2$year-min(rainfall2$year) + !rainfall2$month %in% c(1,2,12)
 indx1 <- cumsum(c(TRUE,diff(indx) <0))
 rainfall2$year2 <- indx1+ (min(rainfall2$year)-1)      

 res2 <- rainfall2 %>%
        group_by(season, year2) %>%
        mutate(seasonal.cumsum=cumsum(RR))

  do.call(rbind,lapply(split(res2, res2$year2), head,2))
  #        DATE month year season  RR year2 seasonal.cumsum
  #1 19260504     5 1926    MAM  50  1926              50
  #2 19260505     5 1926    MAM  84  1926             134
  #3 19261201    12 1926    DJF 120  1927             120
  #4 19261202    12 1926    DJF  26  1927             146
  #5 19271201    12 1927    DJF 112  1928             112
  #6 19271202    12 1927    DJF  78  1928             190
  #7 19281201    12 1928    DJF  96  1929              96
  #8 19281202    12 1928    DJF  26  1929             122

解释

我认为最好创建一个小数据集以便更好地理解

 set.seed(24)
 df <- data.frame(month=rep(rep(1:12,each=4),3), year=rep(1926:1928, each=12*4))

首先,我们正在使用c(1,2,12)检查df$month列中的%in%列中的哪个月TRUE。它返回一个逻辑向量,1表示212!的元素。通过使用否定TRUE,我们尝试将FALSE设为1,反之亦然。这意味着,我们在这里寻找的数月不是212head(!df$month %in% c(1,2,12), 15) # [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE #[13] TRUE TRUE TRUE

year

接下来,我们从数据集中的minimum年减去df$year-min(df$year) #[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 #[38] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #[75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 #[112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 来获取值

TRUE/FALSE

如果我们添加上述两个,第一个1/0将强制转换为整数( indx <- df$year-min(df$year) + !df$month %in% c(1,2,12) indx #[1] 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #[38] 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 #[75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 #[112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 ),我们得到

diff

在第二步中,我们首先执行indxindx的相邻元素之间的差异,这将返回一个元素少于c(TRUE,..)长度的向量。然后检查它返回值的位置&lt; 0.为了使长度相等,我们可以使用 head(diff(indx),55) #[1] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 #[26] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 1 0 0 #[51] 0 0 0 0 0 head(c(TRUE,diff(indx) <0), 55) #[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE #[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE #[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE head(cumsum(c(TRUE,diff(indx) <0)), 55) #[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 #[39] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 indx1 <- cumsum(c(TRUE, diff(indx) <0))

indx1

从上一步开始,我们得到year,然后我们添加最小 head( indx1+ (min(df$year)),55) #[1] 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 #[16] 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 #[31] 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1927 1928 #[46] 1928 1928 1928 1928 1928 1928 1928 1928 1928 1928 indx2 <- indx1+ (min(df$year)) split(df, indx2) #to check the results

rainfall <- structure(list(DATE = c(19260529L, 19260530L, 19260531L, 19260601L, 
 19260602L, 19260603L, 19260604L, 19260605L), year = c(1926L, 
 1926L, 1926L, 1926L, 1926L, 1926L, 1926L, 1926L), month = c(5L, 
 5L, 5L, 6L, 6L, 6L, 6L, 6L), season = c("MAM", "MAM", "MAM", 
 "JJA", "JJA", "JJA", "JJA", "JJA"), RR = c(0L, 0L, 9L, 0L, 3L, 
 71L, 0L, 48L), yearly.cumsum = c(2347L, 2347L, 2356L, 2356L, 
 2359L, 2430L, 2430L, 2478L), seasonal.cumsum = c(2518L, 2518L, 
 2530L, 2530L, 2530L, 2530L, 2530L, 2534L)), .Names = c("DATE", 
 "year", "month", "season", "RR", "yearly.cumsum", "seasonal.cumsum"
 ), class = "data.frame", row.names = c(NA, -8L))

数据

 DATE= format(seq(as.Date("1926-05-04"), length.out=1200, by='1 day'), '%Y%m%d')
 month <- as.numeric(substr(DATE,5,6))
 year <- as.numeric(substr(DATE,1,4))
 season <- ifelse(month %in% c(12,1,2), 'DJF', 
         ifelse(month %in% 3:5, 'MAM', ifelse(month %in% 6:8, 'JJA','SON')))
 set.seed(25)
 RR <- sample(0:120, 1200, replace=TRUE)

 rainfall2 <- data.frame(DATE, month, year, season, RR, stringsAsFactors=FALSE)

newdata

{{1}}

答案 1 :(得分:2)

尝试data.table:

> library(data.table)
> ddt = data.table(rainfall)
> ddt[,scumsum:=cumsum(RR),by=list(season,year)]
> ddt
       DATE year month season RR yearly.cumsum seasonal.cumsum scumsum
1: 19260529 1926     5    MAM  0          2347            2518       0
2: 19260530 1926     5    MAM  0          2347            2518       0
3: 19260531 1926     5    MAM  9          2356            2530       9
4: 19260601 1926     6    JJA  0          2356            2530       0
5: 19260602 1926     6    JJA  3          2359            2530       3
6: 19260603 1926     6    JJA 71          2430            2530      74
7: 19260604 1926     6    JJA  0          2430            2530      74
8: 19260605 1926     6    JJA 48          2478            2534     122

答案 2 :(得分:1)

您实际上可以使用tapply而不创建yearly.cumsum(尽管我同意tapply通过撤消订单来表现有点尴尬)

transform(rainfall, 
          seasonal.cumsum = 
          unlist(rev(tapply(RR, list(season, year), FUN = cumsum))))
#       DATE year month season RR yearly.cumsum seasonal.cumsum
# 1 19260529 1926     5    MAM  0          2347               0
# 2 19260530 1926     5    MAM  0          2347               0
# 3 19260531 1926     5    MAM  9          2356               9
# 4 19260601 1926     6    JJA  0          2356               0
# 5 19260602 1926     6    JJA  3          2359               3
# 6 19260603 1926     6    JJA 71          2430              74
# 7 19260604 1926     6    JJA  0          2430              74
# 8 19260605 1926     6    JJA 48          2478             122