如何正确连接面板数据以提取缺失值?

时间:2016-07-02 12:01:56

标签: r

我想留下连接面板数据,因为缺少一些观察结果。但是,我无法做到这一点并保留面板结构:

数据:

# package I'm using
library(dplyr)

date <- as.Date(as.character(c("2015-02-13",
                            "2015-02-14",
                            "2015-02-16",
                            "2015-02-17",
                            "2015-02-14",
                            "2015-02-16",
                            "2015-02-13",
                            "2015-02-14",
                            "2015-02-17")))

b <-c("John","John","John","John","Michael","Michael","Thomas","Thomas","Thomas")
c <- c(20,30,26,20,30,40,5,10,4)
d <- c(11,2233,12,2,22,13,23,23,100)
# put together
df <- data.frame(b, dates,c,d)

df
         b      dates  c    d
#1    John 2015-02-13 20   11
#2    John 2015-02-14 30 2233
#3    John 2015-02-16 26   12
#4    John 2015-02-17 20    2
#5 Michael 2015-02-14 30   22
#6 Michael 2015-02-16 40   13
#7  Thomas 2015-02-13  5   23
#8  Thomas 2015-02-14 10   23
#9  Thomas 2015-02-17  4  100

我尝试的是创建一个完整的日期向量并离开连接:

date<-as.data.frame(seq(as.Date("2015-02-13"),as.Date("2015-02-17"),by="days"))
# rename seq. to date:
names(date)[names(date)=="seq(as.Date(\"2015-02-13\"), as.Date(\"2015-02-17\"), by = \"days\")"] <- "date"

# and left join:

t <- left_join(date,df,by=c("date"="dates"))

t

        date       b  c    d
#1  2015-02-13    John 20   11
#2  2015-02-13  Thomas  5   23
#3  2015-02-14    John 30 2233
#4  2015-02-14 Michael 30   22
#5  2015-02-14  Thomas 10   23
#6  2015-02-15    <NA> NA   NA
#7  2015-02-16    John 26   12
#8  2015-02-16 Michael 40   13
#9  2015-02-17    John 20    2
#10 2015-02-17  Thomas  4  100

我如何实现这样的结果:

     b      dates  c    d
#1    John 2015-02-13 20   11
#2    John 2015-02-14 30 2233
#3    John 2015-02-15 NA   NA
#4    John 2015-02-16 26   12
#5    John 2015-02-17 20    2
#6 Michael 2015-02-13 NA   NA
#7 Michael 2015-02-14 30   22
#8 Michael 2015-02-15 NA   NA
#9 Michael 2015-02-16 40   13
#10Michael 2015-02-17 NA   NA
#7 Thomas 2015-02-13  5    23
#8 Thomas 2015-02-14 10    23
#8 Thomas 2015-02-15 NA    NA
#8 Thomas 2015-02-16 NA    NA
#9 Thomas 2015-02-17  4   100

1 个答案:

答案 0 :(得分:5)

我们可以使用expand.grid

 library(dplyr)
 expand.grid(b = unique(df$b), date = seq(min(df$date), max(df$date), by = "1 day")) %>% 
     left_join(., df) %>%
     arrange(b, date)
#         b       date  c    d
#1     John 2015-02-13 20   11
#2     John 2015-02-14 30 2233
#3     John 2015-02-15 NA   NA
#4     John 2015-02-16 26   12
#5     John 2015-02-17 20    2
#6  Michael 2015-02-13 NA   NA
#7  Michael 2015-02-14 30   22
#8  Michael 2015-02-15 NA   NA
#9  Michael 2015-02-16 40   13
#10 Michael 2015-02-17 NA   NA
#11  Thomas 2015-02-13  5   23
#12  Thomas 2015-02-14 10   23
#13  Thomas 2015-02-15 NA   NA
#14  Thomas 2015-02-16 NA   NA
#15  Thomas 2015-02-17  4  100

或使用complete

中的tidyr
library(tidyr)
complete(df, b, date = seq(min(date), max(date), by = "1 day")) 
#        b       date     c     d
#    <fctr>     <date> <dbl> <dbl>
#1     John 2015-02-13    20    11
#2     John 2015-02-14    30  2233
#3     John 2015-02-15    NA    NA
#4     John 2015-02-16    26    12
#5     John 2015-02-17    20     2
#6  Michael 2015-02-13    NA    NA
#7  Michael 2015-02-14    30    22
#8  Michael 2015-02-15    NA    NA
#9  Michael 2015-02-16    40    13
#10 Michael 2015-02-17    NA    NA
#11  Thomas 2015-02-13     5    23
#12  Thomas 2015-02-14    10    23
#13  Thomas 2015-02-15    NA    NA
#14  Thomas 2015-02-16    NA    NA
#15  Thomas 2015-02-17     4   100