Question

我以10分钟的高时间分辨率对不同的城市树种进行了温度测量，应该比较它们的反应。因此，我正在专门研究高温时期。我无法对数据集执行的任务是从最大值中选择整天。例如。在30°C以上进行测量的日子应该完全从我的数据框中得到。在下面，您可以找到一个可复制的示例，该示例可以说明我的问题：

在我的Measurings数据框中，我计算出一列，指示单个测量值是高于还是低于30°C。我想使用该列来告诉其他函数是否应该选择一天而不产生New Dataframe。当一天中的任何时候该值都高于30°C时，我想在New Dataframe的00:00到23:59的日期之间将其包括在内，以进行进一步的分析。

start <- as.POSIXct("2018-05-18 00:00", tz = "CET")
tseq <- seq(from = start, length.out = 1000, by = "hours")

Measurings <- data.frame(
  Time = tseq,
  Temp = sample(20:35,1000, replace = TRUE),
  Variable1 = sample(1:200,1000, replace = TRUE),
  Variable2 = sample(300:800,1000, replace = TRUE)
)

Measurings$heat30 <- ifelse(Measurings$Temp > 30,"heat", "normal")

Measurings$otheroption30 <- ifelse(Measurings$Temp > 30,"1", "0")

该示例产生了类似于我的数据结构的数据框：

head(Measurings)

                 Time Temp Variable1 Variable2 heat30 otheroption30
1 2018-05-18 00:00:00   28        56       377 normal             0
2 2018-05-18 01:00:00   23        65       408 normal             0
3 2018-05-18 02:00:00   29        78       324 normal             0
4 2018-05-18 03:00:00   24       157       432 normal             0
5 2018-05-18 04:00:00   32       129       794   heat             1
6 2018-05-18 05:00:00   25        27       574 normal             0

那么我该如何子集化一个 New Dataframe ，其中花了整天的时间，其中至少一个条目表示为“热”？

我知道例如dplyr:filter可以过滤单个条目（示例开头的第5行）。 但是我怎么知道要花整整2018-05-18？

我对使用R分析数据还很陌生，因此，对解决我的问题的有效建议，我将不胜感激。 dplyr是我用于许多任务的工具，但是我愿意接受任何可行的方法。

非常感谢，康拉德

Answer 1

以下是使用问题中提供的数据集的一种可能的解决方案。请注意，这不是一个很好的例子，因为整天将可能包括至少一个标记为超过30°C的观测值（即，在该数据集中没有天数可以过滤掉，但是代码应实际的工作）。

# import packages
library(dplyr)
library(stringr)

# break the time stamp into Day and Hour
time_df <- as_data_frame(str_split(Measurings$Time, " ", simplify = T))

# name the columns
names(time_df) <- c("Day", "Hour")

# create a new measurement data frame with separate Day and Hour columns
new_measurings_df <- bind_cols(time_df, Measurings[-1])

# form the new data frame by filtering the days marked as heat
new_df <- new_measurings_df %>%
  filter(Day %in% new_measurings_df$Day[new_measurings_df$heat30 == "heat"])

更精确地说，您将创建一个随机样本，其中包含1000个观测值，其40天的温度在20到35之间变化。结果，在您的示例中，很可能每一天都有至少一个观测值标记为超过30°C。另外，设置种子以确保reproducibility始终是一个好习惯。

Answer 2

创建一个变量，该变量指定哪一天（营业时间，分钟等）。遍历唯一的日期，并仅选择在heat30中至少包含一次“ heat”的子集：

Measurings <- Measurings %>% mutate(Time2 = format(Time, "%Y-%m-%d"))

res <- NULL
newdf <- lapply(unique(Measurings$Time2), function(x){

  ss <- Measurings %>% filter(Time2 == x) %>% select(heat30) %>% pull(heat30) # take heat30 vector
  rr <- Measurings %>% filter(Time2 == x) # select date x

  # check if heat30 vector contains heat value at least once, if so bind that subset 
  if(any(ss == "heat")){
    res <- rbind(res, rr)
  }
  return(res)

}) %>% bind_rows()

R按POSIXct时间和条件过滤/选择数据

2 个答案: