如何在白天汇总POSIXlt以找到每日峰值?

时间:2017-08-09 07:47:18

标签: r date select aggregate threshold

我有这个问题的变体:Count values in a data set that exceed a threshold in R

我几乎以随机的时间间隔进行温度测量。我想知道超过特定阈值的天数(在某个时间范围内)。显然没有聚合我可以在同一天获得多次点击(如果超过阈值多次)。但是我不想要那个。

原始数据框的简短示例如下所示:

                 Time Temp Humidity Notes
1 2015-05-18 16:00:00 26.5       NA  <NA>
2 2015-06-01 15:00:00 26.5       NA  <NA>
3 2015-06-02 16:00:00 28.0       NA  <NA>
4 2015-06-03 16:00:00 28.0       NA  <NA>
5 2015-06-03 17:00:00 30.0       60  <NA>
6 2015-06-05 07:00:00 23.0       NA  <NA>

所以我计算了一个Day变量(POSIXlt):

1 2015-05-18 16:00:00 26.5       NA  <NA> 2015-05-18
2 2015-06-01 15:00:00 26.5       NA  <NA> 2015-06-01
3 2015-06-02 16:00:00 28.0       NA  <NA> 2015-06-02
4 2015-06-03 16:00:00 28.0       NA  <NA> 2015-06-03
5 2015-06-03 17:00:00 30.0       60  <NA> 2015-06-03
6 2015-06-05 07:00:00 23.0       NA  <NA> 2015-06-05

我几乎绝望地试图在白天聚集(我没有展示我尝试过的所有变种):

with(t, aggregate(Temp ~ Day, data=t, FUN=max))
Error in model.frame.default(formula = Temp ~ Day, data = t) : 
  invalid type (list) for variable 'Day'

只有当我明确地将POSIXlt转换为POSIXct时,它才有效(为什么有一个类POSTXlt被聚合视为列表?):

> with(t, aggregate(Temp ~ as.POSIXct(Day), data=t, FUN=max))
  as.POSIXct(Day) Temp
1      2015-05-18 26.5
2      2015-06-01 26.5
3      2015-06-02 28.0
4      2015-06-03 30.0
5      2015-06-05 23.0

不幸的是,我在聚合期间丢失了其他列。我该如何保存它们?

我也不明白这一点:

> tt <-with(t, aggregate(Temp ~ as.POSIXct(Day), data=t, FUN=max))
> tt
  as.POSIXct(Day) Temp
1      2015-05-18 26.5
2      2015-06-01 26.5
3      2015-06-02 28.0
4      2015-06-03 30.0
5      2015-06-05 23.0
> str(tt)
'data.frame':   5 obs. of  2 variables:
 $ as.POSIXct(Day): POSIXct, format: "2015-05-18" "2015-06-01" ...
 $ Temp           : num  26.5 26.5 28 30 23
> tt$Temp > 25
[1]  TRUE  TRUE  TRUE  TRUE FALSE
> tt[tt$Temp > 25]
Error in `[.data.frame`(tt, tt$Temp > 25) : undefined columns selected
> tt[tt$Temp > 25,]
  as.POSIXct(Day) Temp
1      2015-05-18 26.5
2      2015-06-01 26.5
3      2015-06-02 28.0
4      2015-06-03 30.0
> t$Temp > 25
[1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
> t[t$Temp > 25]
                 Time Temp Humidity Notes        Day
1 2015-05-18 16:00:00 26.5       NA  <NA> 2015-05-18
2 2015-06-01 15:00:00 26.5       NA  <NA> 2015-06-01
3 2015-06-02 16:00:00 28.0       NA  <NA> 2015-06-02
4 2015-06-03 16:00:00 28.0       NA  <NA> 2015-06-03
5 2015-06-03 17:00:00 30.0       60  <NA> 2015-06-03
6 2015-06-05 07:00:00 23.0       NA  <NA> 2015-06-05

为什么aggregate()会更改t的结构?有人可以解释我错过了什么吗?

作为参考,样本数据集(具有另一个变量Timdifftime)以dput()格式保存从一天开始的测量偏移量:

> dput(t)
structure(list(Time = structure(list(sec = c(0, 0, 0, 0, 0, 0
), min = c(0L, 0L, 0L, 0L, 0L, 0L), hour = c(16L, 15L, 16L, 16L, 
17L, 7L), mday = c(18L, 1L, 2L, 3L, 3L, 5L), mon = c(4L, 5L, 
5L, 5L, 5L, 5L), year = c(115L, 115L, 115L, 115L, 115L, 115L), 
    wday = c(1L, 1L, 2L, 3L, 3L, 5L), yday = c(137L, 151L, 152L, 
    153L, 153L, 155L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L), zone = c("CEST", 
    "CEST", "CEST", "CEST", "CEST", "CEST"), gmtoff = c(NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
    )), .Names = c("sec", "min", "hour", "mday", "mon", "year", 
"wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt", 
"POSIXt")), Temp = c(26.5, 26.5, 28, 28, 30, 23), Humidity = c(NA, 
NA, NA, NA, 60, NA), Notes = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_), 
    Day = structure(list(sec = c(0, 0, 0, 0, 0, 0), min = c(0L, 
    0L, 0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L), mday = c(18L, 
    1L, 2L, 3L, 3L, 5L), mon = c(4L, 5L, 5L, 5L, 5L, 5L), year = c(115L, 
    115L, 115L, 115L, 115L, 115L), wday = c(1L, 1L, 2L, 3L, 3L, 
    5L), yday = c(137L, 151L, 152L, 153L, 153L, 155L), isdst = c(-1L, 
    -1L, -1L, -1L, -1L, -1L), zone = c("CEST", "CEST", "CEST", 
    "CEST", "CEST", "CEST"), gmtoff = c(NA_integer_, NA_integer_, 
    NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec", 
    "min", "hour", "mday", "mon", "year", "wday", "yday", "isdst", 
    "zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Tim = structure(c(16, 
    15, 16, 16, 17, 7), class = "difftime", units = "hours")), .Names = c("Time", 
"Temp", "Humidity", "Notes", "Day", "Tim"), row.names = c(NA, 
6L), class = "data.frame")

1 个答案:

答案 0 :(得分:1)

以下是使用dplyrdata.tablesplit - lapply组合而无任何套餐的三种解决方案:

在每种情况下,将Time转换为POSIXct,将Day转换为Date(我在此处调用您提供示例的数据集并选择27作为截止值,因此我们有一个有两个匹配行的日子。):

sample$Time <- as.POSIXct(sample$Time)
sample$Day  <- as.Date(sample$Day)

使用dplyr,动词可以说明问题,这就是为什么这是我最喜欢的解决方案:

require(dplyr)

result <- sample %>% 
          group_by(Day) %>% 
          summarise(greater27=max(Temp > 27))

result
# # A tibble: 5 x 2
#          Day greater27
#       <date>     <int>
# 1 2015-05-18         0
# 2 2015-06-01         0
# 3 2015-06-02         1
# 4 2015-06-03         1
# 5 2015-06-05         0

使用data.table,将第一个参数留空以选择所有行,在第二个参数中进行计算,并将by指定为命名参数:

require(data.table)

sample <- data.table(sample)
result <- sample[, .(greater27=max(Temp > 27)), by="Day"]

result
#           Day greater27
# 1: 2015-05-18         0
# 2: 2015-06-01         0
# 3: 2015-06-02         1
# 4: 2015-06-03         1
# 5: 2015-06-05         0

只有基数R split数据集Day这会给你一个data.frames列表然后应用匿名函数,最后rbind所有内容重新组合成一个data.frame:

result <- do.call(rbind, 
                  lapply(split(sample, sample$Day),
                         function(x){
                           data.frame(
                             Day = x$Day[1],
                             greater27 = max(x$Temp > 27)
                           )
                         }
                    )
            )

result
#                   Day greater27
# 2015-05-18 2015-05-18         0
# 2015-06-01 2015-06-01         0
# 2015-06-02 2015-06-02         1
# 2015-06-03 2015-06-03         1
# 2015-06-05 2015-06-05         0