如何每天从每小时大型数据集中按因子变量水平提取平均值并估算其他统计信息

时间:2019-07-10 14:00:34

标签: r data.table tidyverse lubridate

数据帧df1总结了人们在特定时间段内进入公厕的不同日期时间(例如,在“ 2017-06-01”和“ 2017-06-30”之间)。 Zone列指定了放置洗手间的区域,分为两个层次:A(聚会区域)或B(居住区域)。

我在下面显示了我所拥有的可复制示例。本示例仅包含两天时间以减小示例数据集的大小。为了创建df1,我必须先创建4个单独的数据框,然后将它们绑定以创建数据框df1(尝试一次创建df1时出现错误)。 df1有193行。

options(digits.secs=3)
day_1_A<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 00:22:43.456","2017-06-01 00:34:43.456","2017-06-01 00:45:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:17:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
                 ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))

day_1_B<- data.frame(Datetime= ymd_hms(c("2017-06-01 00:04:17.986","2017-06-01 00:17:43.456","2017-06-01 01:15:23.275","2017-06-01 01:41:32.609","2017-06-01 02:04:17.986","2017-06-01 02:17:43.456","2017-06-01 03:15:23.275","2017-06-01 03:41:32.609","2017-06-01 04:04:17.986","2017-06-01 04:17:43.456","2017-06-01 05:15:23.275","2017-06-01 05:41:32.609","2017-06-01 06:04:17.986","2017-06-01 06:17:43.456","2017-06-01 07:15:23.275","2017-06-01 07:41:32.609","2017-06-01 08:04:17.986","2017-06-01 08:17:43.456","2017-06-01 09:15:23.275","2017-06-01 09:41:32.609","2017-06-01 10:04:17.986","2017-06-01 10:17:43.456","2017-06-01 11:15:23.275","2017-06-01 11:41:32.609","2017-06-01 12:04:17.986","2017-06-01 12:17:43.456","2017-06-01 13:15:23.275","2017-06-01 13:41:32.609","2017-06-01 14:04:17.986","2017-06-01 14:17:43.456","2017-06-01 15:15:23.275","2017-06-01 15:41:32.609","2017-06-01 16:04:17.986","2017-06-01 16:17:43.456","2017-06-01 17:15:23.275","2017-06-01 17:41:32.609","2017-06-01 18:04:17.986","2017-06-01 18:17:43.456","2017-06-01 19:15:23.275","2017-06-01 19:41:32.609","2017-06-01 20:04:17.986","2017-06-01 20:17:43.456","2017-06-01 21:15:23.275","2017-06-01 21:41:32.609","2017-06-01 22:04:17.986","2017-06-01 22:17:43.456","2017-06-01 23:15:23.275","2017-06-01 23:41:32.609")),
                 ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))

day_2_A<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:17:43.456","2017-06-02 00:48:43.456","2017-06-02 01:15:23.275","2017-06-02 01:52:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
                 ToiletZone = c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"))

day_2_B<- data.frame(Datetime= ymd_hms(c("2017-06-02 00:04:17.986","2017-06-02 01:15:23.275","2017-06-02 02:04:17.986","2017-06-02 02:17:43.456","2017-06-02 03:15:23.275","2017-06-02 03:41:32.609","2017-06-02 04:04:17.986","2017-06-02 04:17:43.456","2017-06-02 05:15:23.275","2017-06-02 05:41:32.609","2017-06-02 06:04:17.986","2017-06-02 06:17:43.456","2017-06-02 07:15:23.275","2017-06-02 07:41:32.609","2017-06-02 08:04:17.986","2017-06-02 08:17:43.456","2017-06-02 09:15:23.275","2017-06-02 09:41:32.609","2017-06-02 10:04:17.986","2017-06-02 10:17:43.456","2017-06-02 11:15:23.275","2017-06-02 11:41:32.609","2017-06-02 12:04:17.986","2017-06-02 12:17:43.456","2017-06-02 13:15:23.275","2017-06-02 13:41:32.609","2017-06-02 14:04:17.986","2017-06-02 14:17:43.456","2017-06-02 15:15:23.275","2017-06-02 15:41:32.609","2017-06-02 16:04:17.986","2017-06-02 16:17:43.456","2017-06-02 17:15:23.275","2017-06-02 17:41:32.609","2017-06-02 18:04:17.986","2017-06-02 18:17:43.456","2017-06-02 19:15:23.275","2017-06-02 19:41:32.609","2017-06-02 20:04:17.986","2017-06-02 20:17:43.456","2017-06-02 21:15:23.275","2017-06-02 21:41:32.609","2017-06-02 22:04:17.986","2017-06-02 22:17:43.456","2017-06-02 23:15:23.275","2017-06-02 23:41:32.609")),
                 ToiletZone = c("B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B"))


df1<- rbind(day_1_A,day_1_B,day_2_A,day_2_B)
df1

> df1
                   Datetime ToiletZone
1   2017-06-01 00:04:17.986          A
2   2017-06-01 00:17:43.455          A
3   2017-06-01 00:22:43.455          A
4   2017-06-01 00:34:43.455          A
5   2017-06-01 00:45:43.455          A
6   2017-06-01 01:15:23.275          A
.               .                    .
.               .                    .
.               .                    .
193 2017-06-02 23:41:32.608          B

由于某些原因,我在这里不做解释,我需要为每个天和每个区域计算一个称为θ的统计量,该统计量可以定义为“平均小时数除以通过“整个感兴趣期间的平均每小时访问次数”(Hourly_daily_μ)来访问“白天上厕所”(Overall_hourly_μ)。

我在图片中显示了对上一个示例的期望(合并了Hourly_daily_μ_AHourly_daily_μ_BOverall_hourly_μ_AOverall_hourly_μ_A列以澄清计算。我真正需要的列是θ_Aθ_B): enter image description here

为什么Hourly_daily_μ_A在2017-06-01是51/24?因为这一天有51个人上厕所。因此,如果我们将24岁之间的人数相除,便得出了今天去厕所的人的小时均值。

为什么Overall_hourly_μ_A在不同日期的每个区域都相同?因为这是每个区域的总体平均值。在这里,我们想知道每小时上厕所的人的平均水平是多少。在此示例中,我们知道在6月1日至6月2日之间,A区有99人上厕所。因此,我们将其除以总小时数(在本示例中为48小时),然后得出总体小时均值在A区中上厕所的人数。每个区的值都是唯一的。

为什么θ_A在2017-06-01上是(51 * 48)/(24 * 99)?因为是{{1}(51/24)除以Hourly_daily_μ_A(99/48)的结果。

有人知道怎么做吗?我的数据帧很大,因此我认为包Overall_hourly_μ_A可能是一个不错的选择。

2 个答案:

答案 0 :(得分:1)

一个选项是按频率计数分组,进行一些计算以获得期望的输出

library(dplyr)
library(tidyr)
library(lubridate)
df1 %>% 
  mutate(Date = floor_date(Datetime, "hour")) %>% 
  group_by(ToiletZone, Date) %>% 
  mutate(hourlyCount = n(), HourlyAvg = hourlyCount/24) %>% 
  group_by(ToiletZone) %>% 
  mutate(Total = sum(hourlyCount)/ n() * 24) %>% 
  group_by(Date = as.Date(Date), add = TRUE) %>% 
  summarise(Theta = hourlyCount[1]/Total[1]) %>%
  spread(ToiletZone, Theta)

答案 1 :(得分:1)

我认为您只需要将日期设为天单位,然后就可以将其用于分组。 使用data.table

setDT(df1)

df1[, Date := floor_date(Datetime, "day")]

daily <- df1[, .(DailyCount = .N, DailyAvg = .N / 24), by = .(ToiletZone, Date)]
overall <- daily[, .(Total = sum(DailyCount) / (.N * 24)), by = .(ToiletZone)]

overall[daily, .(ToiletZone, Date, Theta = DailyAvg / Total), on = "ToiletZone"]
   ToiletZone       Date     Theta
1:          A 2017-06-01 1.0303030
2:          B 2017-06-01 1.0212766
3:          A 2017-06-02 0.9696970
4:          B 2017-06-02 0.9787234

每小时的时间差不多, 只需更改floor_date并调整一些分母:

df1[, Date := floor_date(Datetime, "hour")]

hourly <- df1[, .(HourlyCount = .N), by = .(ToiletZone, Date)]
overall <- hourly[, .(Total = sum(HourlyCount) / .N), by = "ToiletZone"]

ans <- overall[hourly, .(ToiletZone, Date, Theta = HourlyCount / Total), on = "ToiletZone"]

顺便说一句,最后几行是联接, 您可以将它们分别视为左连接, dailyhourly作为左侧表格。