使用 dplyr 和 lubridate 按月和年组合计数和分组

时间:2021-01-04 01:55:26

标签: r dplyr group-by count lubridate

我有一个数据框,其中每一行代表一个城市中发生的一个事件。数据框显示城市名称和发生日期,如下所示:

df <- data.frame(city = c("Seattle", "Seattle", "Seattle", "Seattle", "Seattle", "NYC", "NYC", "NYC", "Chicago",
                         "Chicago", "Chicago", "Chicago", "Chicago"),
                     date_of_event = c("01/13/2011", "01/17/2011", "03/15/2011", "05/21/2011", "05/23/2011",
                                      "01/20/2011", "01/22/2011", "03/23/2011", "01/18/2011", "02/24/2011",
                                       "02/26/2011", "04/30/2011", "06/18/2011"),
                     stringsAsFactors = FALSE)

df$date_of_event <- as.Date(df$date_of_event, "%m/%d/%Y")

以上只是一个例子,我的数据实际上是在一个有数千行、许多城市、许多日期等的 csv 中。我想做的是生成一个新的数据框,每个城市和每个月都有一行/year 表示在数据集中,以及相应的计数列,显示原始数据框中每个城市每个月发生的事件次数。第二个数据框看起来像这样:

df2 <- data.frame(city = c("Seattle", "Seattle", "Seattle", "Seattle", "Seattle", "Seattle", "NYC", "NYC", "NYC", "NYC",
                           "NYC", "NYC", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago"),
                     month_year = c("01/01/2011", "02/01/2011", "03/01/2011", "04/01/2011", "05/01/2011", "06/01/2011",
                                    "01/01/2011", "02/01/2011", "03/01/2011", "04/01/2011", "05/01/2011", "06/01/2011",
                                    "01/01/2011", "02/01/2011", "03/01/2011", "04/01/2011", "05/01/2011", "06/01/2011"),
                  count = c(2, 0, 1, 0, 2, 0, 2, 0, 1, 0, 0, 0, 1, 2, 0, 1, 0, 1),
                     stringsAsFactors = FALSE)

df2$month_year <- as.Date(df2$month_year, "%m/%d/%Y")

我知道您可以使用 dplyr 中的计数,也可以使用 lubridate 将日期四舍五入到每个月的第一天,但​​我已经尝试并未能正确进行分组和计数以生成我想要的第二个数据帧.

1 个答案:

答案 0 :(得分:1)

你可以试试这个:

library(tidyverse)
library(lubridate)

df3 <- df %>% mutate(new_date = floor_date(date_of_event, "month")) 
tt <- as.data.frame(table(df3[-2])) 
tt[order(desc(tt$city), tt$new_date),]

      city   new_date Freq
   Seattle 2011-01-01    2
   Seattle 2011-02-01    0
   Seattle 2011-03-01    1
   Seattle 2011-04-01    0
   Seattle 2011-05-01    2
   Seattle 2011-06-01    0
       NYC 2011-01-01    2
       NYC 2011-02-01    0
       NYC 2011-03-01    1
       NYC 2011-04-01    0
       NYC 2011-05-01    0
       NYC 2011-06-01    0
   Chicago 2011-01-01    1
   Chicago 2011-02-01    2
   Chicago 2011-03-01    0
   Chicago 2011-04-01    1
   Chicago 2011-05-01    0
   Chicago 2011-06-01    1

要包括零计数的延长时间,您可以尝试以下操作:

# assign a name to the output obtained previously
df4 <- tt[order(desc(tt$city), tt$new_date),]

a <- mdy("01/01/11") # starting period 
b <- a + months(0:92)  # period sequence

df5 <- expand.grid(city = c("Chicago", "Seattle", "NYC"), new_date = as.factor(b)) 

df6 <- setdiff(df5, df4[-3])
df6$Freq <- 0 # assign zero count

df7 <- rbind(df4, df6)

df8 <- df7[order(df7$city, df7$new_date), ]

相关问题