在dplyr中汇总每小时计数

时间:2018-03-08 17:12:37

标签: r

我在R

中有以下数据框
Date                      ID      
01-01-2017 12:39:00       CDF
01-01-2017 01:39:00       WED
01-01-2017 02:39:00       QWE
01-01-2017 05:39:00       TYU
01-01-2017 17:39:00       ERT
02-01-2017 02:30:34       DEF   

我想计算每小时的ID数。我想要的数据框是

Date           hours               Count
01-01-2017     00:00 - 01:00       1
01-01-2017     01:00 - 02:00       1
01-01-2017     02:00 - 03:00       1
01-01-2017     03:00 - 04:00       0
01-01-2017     04:00 - 05:00       0
01-01-2017     05:00 - 06:00       1
.
01-01-2017     23:00 - 00:00       0 
.
02-01-2017     12:00 - 01:00       0 
02-01-2017     01:00 - 02:00       0
02-01-2017     02:00 - 03:00       1

如果没有id存在,我希望每小时桶为零。每个日期都包含24小时运动。

我如何在R?

中实现这一目标

2 个答案:

答案 0 :(得分:1)

这是使用lubridatebase R

的一种方法

在您提供的数据集中,您的第一次观察是01-01-2017 12:39:00,但在您想要的输出中,有00:00 - 01:00的计数。在下面的代码中, 12:39:00将被视为下午12:39,因此我会假设你的意思 00:39:00。如果情况不是这样,请告诉我

library(lubridate)
# the data
txt <- "Date,ID      
01-01-2017 00:39:00,CDF
01-01-2017 01:39:00,WED
01-01-2017 02:39:00,QWE
01-01-2017 05:39:00,TYU
01-01-2017 17:39:00,ERT
02-01-2017 02:30:34,DEF"

df <- read.table(text = txt,sep = ",", header = TRUE)
# transforming the date strings into dates
dates <- as.POSIXct(strptime(df$Date, "%d-%m-%Y %H:%M:%S"))
# creating an hourly time sequence from start to end
total_time <- seq(from = floor_date(min(dates), "hour"), to = 
ceiling_date(max(dates), "hour"), by = "hour")

# in case there is more than one occurrence per interval  
count <-  sapply(total_time, function(x) {       
          sum(floor_date(dates,"hour") %in% x) })

data.frame(Date = strftime(total_time, format = "%d-%m-%Y"),
           hours = paste(strftime(total_time, format = "%H:%M"), 
                    strftime(total_time + 60*60, format="%H:%M"),         
                    sep = " - "),
           Count = count)

#          Date         hours Count
# 1  01-01-2017 00:00 - 01:00     1
# 2  01-01-2017 01:00 - 02:00     1
# 3  01-01-2017 02:00 - 03:00     1
# 4  01-01-2017 03:00 - 04:00     0
# 5  01-01-2017 04:00 - 05:00     0
# 6  01-01-2017 05:00 - 06:00     1
# 7  01-01-2017 06:00 - 07:00     0

答案 1 :(得分:1)

tidyverse提供了一些有用的功能,例如count / tallycomplete

library(tidyverse)
library(lubridate)

dat <- read_csv('Date, ID      
  01-01-2017 12:39:00, CDF
  01-01-2017 01:39:00, WED
  01-01-2017 02:39:00, QWE
  01-01-2017 05:39:00, TYU
  01-01-2017 17:39:00, ERT
  02-01-2017 02:30:34, DEF'
) 

dat %>% 
   mutate(
       Date = dmy_hms(Date),
       day = floor_date(Date, 'day'), 
       hour = hour(Date)
   ) %>%
   group_by(day, hour) %>%
   tally %>%
   complete(day, hour = 0:23, fill = list('n' = 0))


## A tibble: 48 x 3
## Groups:   day [2]
#          day  hour     n
#       <dttm> <int> <dbl>
# 1 2017-01-01     0     0
# 2 2017-01-01     1     1
# 3 2017-01-01     2     1
# 4 2017-01-01     3     0
# 5 2017-01-01     4     0
# 6 2017-01-01     5     1
# 7 2017-01-01     6     0
# 8 2017-01-01     7     0
# 9 2017-01-01     8     0
#10 2017-01-01     9     0
## ... with 38 more rows