使用dplyr缺少数据

时间:2017-01-22 02:48:00

标签: r dplyr

再一次继续我前两个问题,但问题略有不同。我一直在使用的数据中的另一个皱纹:

date <- c("2016-03-24","2016-03-24","2016-03-24","2016-03-24","2016-03-24",
          "2016-03-24","2016-03-24","2016-03-24","2016-03-24")
location <- c(1,1,2,2,3,3,4,"out","out")
sensor <- c(1,16,1,16,1,16,1,1,16)
Temp <- c(35,34,92,42,21,47,42,63,12)
df <- data.frame(date,location,sensor,Temp)

我的部分数据缺少值。 NA未指明它们。它们不在数据期间。

我想从位置“4”中减去位置“out”而忽略其他位置,我想通过日期和传感器来做。我已成功完成了包含所有数据的数据位置以及以下代码

df %>%
  filter(location %in% c(4, 'out')) %>% 
  group_by(date, sensor) %>% 
  summarize(Diff = Temp[location=="4"] - Temp[location=="out"],
            location = first(location)) %>%
  select(1, 2, 4, 3) 

但是对于缺少日期的数据,我收到以下错误Error: expecting a single value。我认为这是因为dplyr在到达缺失的数据点时不知道该怎么做。

进行一些研究,似乎do是可行的方法,但它会返回一个数据框,而不会相互减去任何值。

df %>%
  filter(location %in% c(4, 'out')) %>% 
  group_by(date, sensor) %>% 
  do(Diff = Temp[location=="4"] - Temp[location=="out"],
            location = first(location)) %>%
  select(1, 2, 4, 3) 

有没有办法覆盖dplyr并告诉它如果找不到要删除的条目之一就返回NA

2 个答案:

答案 0 :(得分:3)

library(tidyverse)

date <- c("2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24",
          "2016-03-24", "2016-03-24", "2016-03-24", "2016-03-24")
location <- c(1, 1, 2, 2, 3, 3, 4, "out", "out")
sensor <- c(1, 16, 1, 16, 1, 16, 1, 1, 16)
Temp <- c(35, 34, 92, 42, 21, 47, 42, 63, 12)

df <- data_frame(date, location, sensor, Temp)

# edge case helper
`%||0%` <- function (x, y) { if (is.null(x) | length(x) == 0) y else x }

df %>%
  filter(location %in% c(4,  'out')) %>%
  mutate(location=factor(location, levels=c("4", "out"))) %>%             # make location a factor 
  arrange(sensor, location) %>%                                           # order it so we can use diff()
  group_by(date,  sensor) %>%
  summarize(Diff = diff(Temp) %||0% NA, location = first(location)) %>% # deal with the edge case
  select(1,  2,  4,  3)
## Source: local data frame [2 x 4]
## Groups: date [1]
## 
##         date sensor location  Diff
##        <chr>  <dbl>   <fctr> <dbl>
## 1 2016-03-24      1        4    21
## 2 2016-03-24     16      out    NA

答案 1 :(得分:1)

如果我们想要返回NA,可能的选项是

library(dplyr)
df %>%
    filter(location %in% c(4, 'out')) %>% 
    group_by(date, sensor) %>%
    arrange(sensor, location) %>%   
    summarise(Diff = if(n()==1) NA else diff(Temp), location = first(location))  %>%
    select(1, 2, 4, 3)
#        date sensor location  Diff
#      <fctr>  <dbl>   <fctr> <dbl>
#1 2016-03-24      1        4    21
#2 2016-03-24     16      out    NA

data.table中的等效选项是

library(data.table)
setDT(df)[location %in% c(4, 'out')][
     order(sensor, location), .(Diff = if(.N==1) NA_real_ else diff(Temp), 
      location = location[1]), .(date, sensor)][, c(1, 2, 4, 3), with = FALSE]
#          date sensor location Diff
#1: 2016-03-24      1        4   21
#2: 2016-03-24     16      out   NA