dplyr滚动条件计数

时间:2018-11-20 12:51:10

标签: r dplyr

我的数据框如下:

df <- data.frame(
Item=c("A","A","A","A","A","B","B","B","B","B"),
Date=c("2018-1-1","2018-2-1","2018-3-1","2018-4-1","2018-5-1","2018-1-1","2018-2-1",
      "2018-3-1","2018-4-1","2018-5-1"),
Value=rnorm(10))

我想对按Item分组的新列进行突变,以计算3窗口(或我指定的任何其他整数)中大于0的值的数量。

我对tidyverse很熟悉,因此,最欢迎使用dplyr解决方案。

3 个答案:

答案 0 :(得分:3)

如果要滚动任何东西,请考虑使用zoo::软件包。

df$new<-
zoo::rollsum( df$Value > 0, 3, fill = NA )

#   Item     Date      Value new
#1     A 2018-1-1  0.5852699  NA
#2     A 2018-2-1 -0.7383377   1
#3     A 2018-3-1 -0.3157693   1
#4     A 2018-4-1  1.2475237   1
#5     A 2018-5-1 -1.5479757   1
#6     B 2018-1-1 -0.6913331   0
#7     B 2018-2-1 -0.2423809   0
#8     B 2018-3-1 -1.6363024   0
#9     B 2018-4-1 -0.3256263   1
#10    B 2018-5-1  0.3563144  NA

您可以选择“窗口位置”。仔细研究参数align = c("center", "left", "right")


作为dplyr链:

df %>% group_by(Item) %>% dplyr::mutate( new = zoo::rollsum( Value > 0, 3, fill = NA ))

答案 1 :(得分:1)

您可以使用RcppRoll软件包。

require(RcppRoll)
df$new <- df$new <- RcppRoll::roll_sum(df$Value > 0, 3, fill = NA)

使用Tidyverse:

df %>% 
  group_by(Item) %>% 
  dplyr::mutate(new = RcppRoll::roll_sum(Value > 0, 3, fill = NA))

从速度上看,它比zoo软件包要快:

n <- 10000
df <- data.frame(
  Item = sample(LETTERS, n, replace = TRUE),
  Value = rnorm(n))

df_grouped <- df %>% 
  group_by(Item)
microbenchmark::microbenchmark(
  RcppRoll = df_grouped <- df_grouped %>% dplyr::mutate(new_RcppRoll = RcppRoll::roll_sum(Value > 0, 3, fill = NA)),
  zoo = df_grouped <- df_grouped %>% dplyr::mutate(new_zoo = zoo::rollsum( Value > 0, 3, fill = NA ))
)

结果:

Unit: milliseconds
     expr       min        lq      mean   median        uq       max neval
 RcppRoll  2.509003  2.741993  2.929227  2.83913  2.983726  5.832962   100
      zoo 11.172920 11.785113 13.288970 12.43320 13.607826 25.879754   100

all.equal(df_grouped$new_RcppRoll, df_grouped$new_zoo)
TRUE

答案 2 :(得分:0)

  Item  Date       Value
   <fct> <date>     <int>
 1 A     2018-01-01     3
 2 B     2018-01-01     2
 3 B     2018-02-01    -5
 4 A     2018-02-01    -3
 5 A     2018-03-01     4
 6 B     2018-03-01    -2
 7 A     2018-04-01     5
 8 B     2018-04-01     0
 9 A     2018-05-01     1
10 B     2018-05-01    -4

为清晰起见,更改了rmrm示例,使用了示例(-5:5):

> df <- df %>% mutate(greater_than = (Value>0)*Value) %>%
group_by(Item) %>% arrange(Date) %>% mutate(greater_than = 
zoo::rollapplyr(greater_than, 3, sum, partial = T))
df %>% arrange(Item) %>% head(10)

应如下所示:

 1 A     2018-01-01     3            3
 2 A     2018-02-01    -3            3
 3 A     2018-03-01     4            7
 4 A     2018-04-01     5            9
 5 A     2018-05-01     1           10
 6 B     2018-01-01     2            2
 7 B     2018-02-01    -5            2
 8 B     2018-03-01    -2            2
 9 B     2018-04-01     0            0
10 B     2018-05-01    -4            0