按小时分组计数

时间:2019-11-12 09:48:39

标签: r dplyr time-series aggregate tidyverse

我想获取每种类型(version1version2)每小时的计数。

样本数据:

type <- c('version1','version1','version1','version2','version2')

startdate <- as.POSIXct(c('2017-11-1 02:11:02.000','2018-3-25 02:13:02.000','2019-3-14 03:45:02.000', 

                            '2017-3-14 02:55:02.000','2018-3-14 03:45:02.000'))


df <- data.frame(type, startdate)

df

      type           startdate
1 version1 2017-11-01 02:11:02
2 version1 2018-03-25 02:13:02
3 version1 2019-03-14 03:45:02
4 version2 2017-03-14 02:55:02
5 version2 2018-03-14 03:45:02

在此df中,我们看到version1的02h有两个计数,而03h有一个计数。

version2的一个数字为02h,一个数字为03h。

所需的输出:

   hour version1 version2
1 00:00        0        0
2 01:00        0        0
3 02:00        2        1
4 03:00        1        1

3 个答案:

答案 0 :(得分:4)

我们首先可以从startdate获得小时,每小时获得count行数,type得到小时。 complete缺少小时数,并用0填充其计数,然后使用pivot_wider获取宽格式的数据。

library(dplyr)
library(tidyr)

df %>%
  mutate(hr = lubridate::hour(startdate)) %>%
  count(hr, type) %>%
  complete(type, hr = seq(0, max(hr)), fill = list(n = 0)) %>%
  pivot_wider(names_from = type, values_from = n)

# A tibble: 4 x 3
#     hr version1 version2
#  <int>    <dbl>    <dbl>
#1     0        0        0
#2     1        0        0
#3     2        2        1
#4     3        1        1

答案 1 :(得分:1)

开始日期变量出了点问题。因此,我使用软件包lubridate

对其进行了设置
library(dplyr)
library(tidyr)

type = c('version1','version1','version1','version2','version2')

startdate = lubridate::ymd_hms(c('2017-11-1T02:11:02.000','2018-3-25T02:13:02.000',
                                 '2019-3-14T03:45:02.000','2017-3-14T02:55:02.000',
                                 '2018-3-14T03:45:02.000'))

tibble(type = type, startdate = startdate) %>%
  count(type, hour = lubridate::hour(startdate)) %>%
  spread(type, n)

# A tibble: 2 x 3
   hour version1 version2
  <int>    <int>    <int>
1     2        2        1
2     3        1        1

答案 2 :(得分:1)

Base R解决方案:

# Extract the hour and store it as a vector: 

df$hour <- gsub(".* ", "", trunc(df$startdate, units = "hours"))

# Count the number of observations of each type in each hour: 

df$type_hour_cnt <- with(df,

                        ave(paste(type, hour, sep = " - "),

                            paste(type, hour, sep = " - "), FUN = seq_along))

# Reshape dataframe: 

df <- as.data.frame(as.matrix(xtabs(type_hour_cnt ~ hour + type, df, sparse = T)))

# Extract rownames and store them as "hour" vector and then delete row.names: 

df <- data.frame(cbind(hour = row.names(df), df), row.names = NULL)