在条件组中删除带有NA的ID

时间:2018-10-30 08:38:47

标签: r dplyr data.table

扩展this问题:

我使用以下代码准备了一些数据:

# # Data Preparation ----------------------
library(lubridate)
start_date <- "2018-10-30 00:00:00"
start_date <- as.POSIXct(start_date, origin="1970-01-01")
dates <- c(start_date)
for(i in 1:287) {
    dates <- c(dates, start_date + minutes(i * 10))
}
dates <- as.POSIXct(dates, origin="1970-01-01")
date_val <- format(dates, '%d-%m-%Y')

weather.forecast.data <- data.frame(dateTime = dates, date = date_val)
weather.forecast.data <- rbind(weather.forecast.data, weather.forecast.data, weather.forecast.data, weather.forecast.data)
weather.forecast.data$id <- c(rep('GH1', 288), rep('GH2', 288), rep('GH3', 288), rep('GH4', 288))
weather.forecast.data$radiation <- round(runif(nrow(weather.forecast.data)), 2)

weather.forecast.data$hour <- as.integer(format(weather.forecast.data$dateTime, '%H'))
weather.forecast.data$day_night <- ifelse(weather.forecast.data$hour < 6, 'night', ifelse(weather.forecast.data$hour < 19, 'day', 'night'))

# # GH2: Total Morning missing # #
weather.forecast.data$radiation[(weather.forecast.data$id == 'GH2') & (weather.forecast.data$date == '30-10-2018') & (weather.forecast.data$day_night == 'day')] = NA
weather.forecast.data$hour <- NULL
weather.forecast.data$day_night <- NULL

我的任务是从weather.forecast.data中删除ID,其中对于每个ID和每个日期(早上半(06小时至18小时)),使用dplyr中的辐射值(NA) {1}}。

我想消除给定Rid的行,这些行的整个早上date值都丢失。也就是说,如果radiation的ID缺少早上date。我删除所有具有特定radiationid的行。因此,我们删除了所有144条记录,因为它的早晨缺少辐射。

我们可以看到date在日期GH2上缺少整个早晨的辐射。因此,我们用30-10-2018id == 'GH2'删除了所有144条记录。

date = '30-10-2018'

我有使用setDT(weather.forecast.data) weather.forecast.data[, sum(is.na(radiation)), .(id, date)] id date V1 1: GH1 30-10-2018 0 2: GH1 31-10-2018 0 3: GH2 30-10-2018 78 4: GH2 31-10-2018 0 5: GH3 30-10-2018 0 6: GH3 31-10-2018 0 7: GH4 30-10-2018 0 8: GH4 31-10-2018 0 的代码:

data.table

我需要使用setDT(weather.forecast.data) weather.forecast.data[, hour:= hour(dateTime)] weather.forecast.data[, day_night:=c("night", "day")[(6 <= hour & hour < 19) + 1L]] weather.forecast.data[, date_id := paste(date, id, sep = "__")] weather.forecast.data[, all_is_na := all(is.na(radiation)), .(date_id, day_night)] weather.forecast.data[!(date_id %in% unique(weather.forecast.data[(all_is_na == TRUE) & (day_night == 'day'), date_id]))] 的代码,并且尝试了以下方法。它删除的行比要求的多:

dplyr

注意:输出应通过删除library(dplyr) weather.forecast.data <- weather.forecast.data %>% mutate(hour = as.integer(format(dateTime, '%H'))) %>% mutate(day_night = ifelse(hour < 6, 'night', ifelse(hour < 19, 'day', 'night'))) %>% group_by(date, day_night, id) %>% filter((!all(is.na(radiation))) & (day_night == 'day')) %>% select (-c(hour, day_night)) %>% as.data.frame id = 'GH2'

中的行来返回数据

2 个答案:

答案 0 :(得分:2)

我相信您有点复杂。以下代码可以完成您在问题中描述的内容。

library(lubridate)
library(dplyr)

weather.forecast.data %>%
  mutate(hour = hour(dateTime),
         day_night = c("night", "day")[(6 <= hour & hour < 19) + 1L]) %>%
  group_by(date, id) %>%
  mutate(delete = all(!(is.na(radiation) & day_night == "day"))) %>%
  ungroup() %>%
  filter(delete) %>%
  select(-hour, -day_night, -delete) %>%
  as.data.frame() -> df1

查看给出的144条删除行是否可行。

nrow(weather.forecast.data) - nrow(df1)
#[1] 144

数据。

我重新发布了数据生成代码,在两个地方进行了简化,并调用了set.seed

set.seed(4192)

start_date <- "2018-10-30 00:00:00"
start_date <- as.POSIXct(start_date, origin="1970-01-01")
dates <- start_date + minutes(0:287 * 10)
dates <- as.POSIXct(dates, origin="1970-01-01")
date_val <- format(dates, '%d-%m-%Y')

weather.forecast.data <- data.frame(dateTime = dates, date = date_val)
weather.forecast.data <- rbind(weather.forecast.data, weather.forecast.data, weather.forecast.data, weather.forecast.data)
weather.forecast.data$id <- c(rep('GH1', 288), rep('GH2', 288), rep('GH3', 288), rep('GH4', 288))
weather.forecast.data$radiation <- round(runif(nrow(weather.forecast.data)), 2)

weather.forecast.data$hour <- hour(weather.forecast.data$dateTime)
weather.forecast.data$day_night <- ifelse(weather.forecast.data$hour < 6, 'night', ifelse(weather.forecast.data$hour < 19, 'day', 'night'))

# # GH2: Total Morning missing # #
weather.forecast.data$radiation[(weather.forecast.data$id == 'GH2') & (weather.forecast.data$date == '30-10-2018') & (weather.forecast.data$day_night == 'day')] = NA
weather.forecast.data$hour <- NULL
weather.forecast.data$day_night <- NULL

答案 1 :(得分:0)

您正在过滤 day_night 列中仅包含“ day”的行。如果我对您的理解正确,那么您需要以下条件:

    library(dplyr)
    weather.forecast.data <- weather.forecast.data %>%
      mutate(hour = as.integer(format(dateTime, '%H'))) %>%
      mutate(day_night = ifelse(hour < 6, 'night', ifelse(hour < 19, 'day', 
                                                         'night'))) %>%
      group_by(date, day_night, id) %>%
      filter((!(all(is.na(radiation))) & (day_night == 'day'))) %>%
      select (-c(hour, day_night)) %>%
      as.data.frame

这将删除白天具有所有NA的所有ID。