添加R中过去日期的缺失行(例如,美国县的NYT冠状病毒病例数据集)

时间:2020-04-23 00:47:57

标签: r missing-data

我目前正在使用美国县的《纽约时报》冠状病毒数据集。

按日期设置格式,这样可以在任何给定日期仅输入病例数大于1的县。因此,对于日期(1/21),第一个县只有一个案例,只有一行。

例如:

     date         county       state    fips cases deaths
1   2020-01-21  Snohomish   Washington  53061   1   0     #Snomish data starts 1/21
2   2020-01-22  Snohomish   Washington  53061   1   0
3   2020-01-23  Snohomish   Washington  53061   1   0
4   2020-01-24  Cook        Illinois    17031   1   0     #Cook data starts 1/24
8   2020-01-25  Snohomish   Washington  53061   1   0
7   2020-01-25  Cook        Illinois    17031   1   0
6   2020-01-25  Orange      California  6059    1   0     #Orange data starts 1/25

......

如何填写每个县的缺失日期?

例如,在这里,我想输入前几天库克县和奥兰治县的数据,用0 0表示案件和死亡人数,但保留状态,漏洞和其他信息。我会手动完成,但是现在他有成千上万的县。

1 个答案:

答案 0 :(得分:1)

您可以使用complete添加缺少的日期,并使用fill保存state和其他列。

library(dplyr)
library(tidyr)

df %>%
  mutate(date = as.Date(date)) %>%
  complete(county, date, fill = list(cases = 0, deaths = 0)) %>%
  fill(everything(), .direction = "updown")


#  county    date       state       fips cases deaths
#   <fct>     <date>     <fct>      <int> <dbl>  <dbl>
# 1 Cook      2020-01-21 Illinois   17031     0      0
# 2 Cook      2020-01-22 Illinois   17031     0      0
# 3 Cook      2020-01-23 Illinois   17031     0      0
# 4 Cook      2020-01-24 Illinois   17031     1      0
# 5 Cook      2020-01-25 Illinois   17031     1      0
# 6 Orange    2020-01-21 California  6059     0      0
# 7 Orange    2020-01-22 California  6059     0      0
# 8 Orange    2020-01-23 California  6059     0      0
# 9 Orange    2020-01-24 California  6059     0      0
#10 Orange    2020-01-25 California  6059     1      0
#11 Snohomish 2020-01-21 Washington 53061     1      0
#12 Snohomish 2020-01-22 Washington 53061     1      0
#13 Snohomish 2020-01-23 Washington 53061     1      0
#14 Snohomish 2020-01-24 Washington 53061     0      0
#15 Snohomish 2020-01-25 Washington 53061     1      0

数据

df <- structure(list(date = structure(c(1L, 2L, 3L, 4L, 5L, 5L, 5L), 
.Label = c("2020-01-21", "2020-01-22", "2020-01-23", "2020-01-24", "2020-01-25"),
 class = "factor"),county = structure(c(3L, 3L, 3L, 1L, 3L, 1L, 2L), 
.Label = c("Cook","Orange", "Snohomish"), class = "factor"), 
state = structure(c(3L,3L, 3L, 2L, 3L, 2L, 1L),
 .Label = c("California", "Illinois","Washington"), class = "factor"), 
fips = c(53061L, 53061L, 53061L, 17031L, 53061L, 17031L, 6059L), 
cases = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), deaths = c(0L, 0L, 0L, 0L, 0L, 0L, 0L
)), class = "data.frame", row.names = c(NA, -7L))