我有这个问题的变体:Count values in a data set that exceed a threshold in R:
我几乎以随机的时间间隔进行温度测量。我想知道超过特定阈值的天数(在某个时间范围内)。显然没有聚合我可以在同一天获得多次点击(如果超过阈值多次)。但是我不想要那个。
原始数据框的简短示例如下所示:
Time Temp Humidity Notes
1 2015-05-18 16:00:00 26.5 NA <NA>
2 2015-06-01 15:00:00 26.5 NA <NA>
3 2015-06-02 16:00:00 28.0 NA <NA>
4 2015-06-03 16:00:00 28.0 NA <NA>
5 2015-06-03 17:00:00 30.0 60 <NA>
6 2015-06-05 07:00:00 23.0 NA <NA>
所以我计算了一个Day
变量(POSIXlt
):
1 2015-05-18 16:00:00 26.5 NA <NA> 2015-05-18
2 2015-06-01 15:00:00 26.5 NA <NA> 2015-06-01
3 2015-06-02 16:00:00 28.0 NA <NA> 2015-06-02
4 2015-06-03 16:00:00 28.0 NA <NA> 2015-06-03
5 2015-06-03 17:00:00 30.0 60 <NA> 2015-06-03
6 2015-06-05 07:00:00 23.0 NA <NA> 2015-06-05
我几乎绝望地试图在白天聚集(我没有展示我尝试过的所有变种):
with(t, aggregate(Temp ~ Day, data=t, FUN=max))
Error in model.frame.default(formula = Temp ~ Day, data = t) :
invalid type (list) for variable 'Day'
只有当我明确地将POSIXlt
转换为POSIXct
时,它才有效(为什么有一个类POSTXlt
被聚合视为列表?):
> with(t, aggregate(Temp ~ as.POSIXct(Day), data=t, FUN=max))
as.POSIXct(Day) Temp
1 2015-05-18 26.5
2 2015-06-01 26.5
3 2015-06-02 28.0
4 2015-06-03 30.0
5 2015-06-05 23.0
不幸的是,我在聚合期间丢失了其他列。我该如何保存它们?
我也不明白这一点:
> tt <-with(t, aggregate(Temp ~ as.POSIXct(Day), data=t, FUN=max))
> tt
as.POSIXct(Day) Temp
1 2015-05-18 26.5
2 2015-06-01 26.5
3 2015-06-02 28.0
4 2015-06-03 30.0
5 2015-06-05 23.0
> str(tt)
'data.frame': 5 obs. of 2 variables:
$ as.POSIXct(Day): POSIXct, format: "2015-05-18" "2015-06-01" ...
$ Temp : num 26.5 26.5 28 30 23
> tt$Temp > 25
[1] TRUE TRUE TRUE TRUE FALSE
> tt[tt$Temp > 25]
Error in `[.data.frame`(tt, tt$Temp > 25) : undefined columns selected
> tt[tt$Temp > 25,]
as.POSIXct(Day) Temp
1 2015-05-18 26.5
2 2015-06-01 26.5
3 2015-06-02 28.0
4 2015-06-03 30.0
> t$Temp > 25
[1] TRUE TRUE TRUE TRUE TRUE FALSE
> t[t$Temp > 25]
Time Temp Humidity Notes Day
1 2015-05-18 16:00:00 26.5 NA <NA> 2015-05-18
2 2015-06-01 15:00:00 26.5 NA <NA> 2015-06-01
3 2015-06-02 16:00:00 28.0 NA <NA> 2015-06-02
4 2015-06-03 16:00:00 28.0 NA <NA> 2015-06-03
5 2015-06-03 17:00:00 30.0 60 <NA> 2015-06-03
6 2015-06-05 07:00:00 23.0 NA <NA> 2015-06-05
为什么aggregate()
会更改t
的结构?有人可以解释我错过了什么吗?
作为参考,样本数据集(具有另一个变量Tim
(difftime
)以dput()
格式保存从一天开始的测量偏移量:
> dput(t)
structure(list(Time = structure(list(sec = c(0, 0, 0, 0, 0, 0
), min = c(0L, 0L, 0L, 0L, 0L, 0L), hour = c(16L, 15L, 16L, 16L,
17L, 7L), mday = c(18L, 1L, 2L, 3L, 3L, 5L), mon = c(4L, 5L,
5L, 5L, 5L, 5L), year = c(115L, 115L, 115L, 115L, 115L, 115L),
wday = c(1L, 1L, 2L, 3L, 3L, 5L), yday = c(137L, 151L, 152L,
153L, 153L, 155L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L), zone = c("CEST",
"CEST", "CEST", "CEST", "CEST", "CEST"), gmtoff = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
)), .Names = c("sec", "min", "hour", "mday", "mon", "year",
"wday", "yday", "isdst", "zone", "gmtoff"), class = c("POSIXlt",
"POSIXt")), Temp = c(26.5, 26.5, 28, 28, 30, 23), Humidity = c(NA,
NA, NA, NA, 60, NA), Notes = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_),
Day = structure(list(sec = c(0, 0, 0, 0, 0, 0), min = c(0L,
0L, 0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L), mday = c(18L,
1L, 2L, 3L, 3L, 5L), mon = c(4L, 5L, 5L, 5L, 5L, 5L), year = c(115L,
115L, 115L, 115L, 115L, 115L), wday = c(1L, 1L, 2L, 3L, 3L,
5L), yday = c(137L, 151L, 152L, 153L, 153L, 155L), isdst = c(-1L,
-1L, -1L, -1L, -1L, -1L), zone = c("CEST", "CEST", "CEST",
"CEST", "CEST", "CEST"), gmtoff = c(NA_integer_, NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst",
"zone", "gmtoff"), class = c("POSIXlt", "POSIXt")), Tim = structure(c(16,
15, 16, 16, 17, 7), class = "difftime", units = "hours")), .Names = c("Time",
"Temp", "Humidity", "Notes", "Day", "Tim"), row.names = c(NA,
6L), class = "data.frame")
答案 0 :(得分:1)
以下是使用dplyr
,data.table
和split
- lapply
组合而无任何套餐的三种解决方案:
在每种情况下,将Time
转换为POSIXct
,将Day
转换为Date
(我在此处调用您提供示例的数据集并选择27作为截止值,因此我们有一个有两个匹配行的日子。):
sample$Time <- as.POSIXct(sample$Time)
sample$Day <- as.Date(sample$Day)
使用dplyr
,动词可以说明问题,这就是为什么这是我最喜欢的解决方案:
require(dplyr)
result <- sample %>%
group_by(Day) %>%
summarise(greater27=max(Temp > 27))
result
# # A tibble: 5 x 2
# Day greater27
# <date> <int>
# 1 2015-05-18 0
# 2 2015-06-01 0
# 3 2015-06-02 1
# 4 2015-06-03 1
# 5 2015-06-05 0
使用data.table
,将第一个参数留空以选择所有行,在第二个参数中进行计算,并将by
指定为命名参数:
require(data.table)
sample <- data.table(sample)
result <- sample[, .(greater27=max(Temp > 27)), by="Day"]
result
# Day greater27
# 1: 2015-05-18 0
# 2: 2015-06-01 0
# 3: 2015-06-02 1
# 4: 2015-06-03 1
# 5: 2015-06-05 0
只有基数R split
数据集Day
这会给你一个data.frames
列表然后应用匿名函数,最后rbind
所有内容重新组合成一个data.frame:
result <- do.call(rbind,
lapply(split(sample, sample$Day),
function(x){
data.frame(
Day = x$Day[1],
greater27 = max(x$Temp > 27)
)
}
)
)
result
# Day greater27
# 2015-05-18 2015-05-18 0
# 2015-06-01 2015-06-01 0
# 2015-06-02 2015-06-02 1
# 2015-06-03 2015-06-03 1
# 2015-06-05 2015-06-05 0