我目前正在使用美国县的《纽约时报》冠状病毒数据集。
按日期设置格式,这样可以在任何给定日期仅输入病例数大于1的县。因此,对于日期(1/21),第一个县只有一个案例,只有一行。
例如:
date county state fips cases deaths
1 2020-01-21 Snohomish Washington 53061 1 0 #Snomish data starts 1/21
2 2020-01-22 Snohomish Washington 53061 1 0
3 2020-01-23 Snohomish Washington 53061 1 0
4 2020-01-24 Cook Illinois 17031 1 0 #Cook data starts 1/24
8 2020-01-25 Snohomish Washington 53061 1 0
7 2020-01-25 Cook Illinois 17031 1 0
6 2020-01-25 Orange California 6059 1 0 #Orange data starts 1/25
......
如何填写每个县的缺失日期?
例如,在这里,我想输入前几天库克县和奥兰治县的数据,用0 0表示案件和死亡人数,但保留状态,漏洞和其他信息。我会手动完成,但是现在他有成千上万的县。
答案 0 :(得分:1)
您可以使用complete
添加缺少的日期,并使用fill
保存state
和其他列。
library(dplyr)
library(tidyr)
df %>%
mutate(date = as.Date(date)) %>%
complete(county, date, fill = list(cases = 0, deaths = 0)) %>%
fill(everything(), .direction = "updown")
# county date state fips cases deaths
# <fct> <date> <fct> <int> <dbl> <dbl>
# 1 Cook 2020-01-21 Illinois 17031 0 0
# 2 Cook 2020-01-22 Illinois 17031 0 0
# 3 Cook 2020-01-23 Illinois 17031 0 0
# 4 Cook 2020-01-24 Illinois 17031 1 0
# 5 Cook 2020-01-25 Illinois 17031 1 0
# 6 Orange 2020-01-21 California 6059 0 0
# 7 Orange 2020-01-22 California 6059 0 0
# 8 Orange 2020-01-23 California 6059 0 0
# 9 Orange 2020-01-24 California 6059 0 0
#10 Orange 2020-01-25 California 6059 1 0
#11 Snohomish 2020-01-21 Washington 53061 1 0
#12 Snohomish 2020-01-22 Washington 53061 1 0
#13 Snohomish 2020-01-23 Washington 53061 1 0
#14 Snohomish 2020-01-24 Washington 53061 0 0
#15 Snohomish 2020-01-25 Washington 53061 1 0
数据
df <- structure(list(date = structure(c(1L, 2L, 3L, 4L, 5L, 5L, 5L),
.Label = c("2020-01-21", "2020-01-22", "2020-01-23", "2020-01-24", "2020-01-25"),
class = "factor"),county = structure(c(3L, 3L, 3L, 1L, 3L, 1L, 2L),
.Label = c("Cook","Orange", "Snohomish"), class = "factor"),
state = structure(c(3L,3L, 3L, 2L, 3L, 2L, 1L),
.Label = c("California", "Illinois","Washington"), class = "factor"),
fips = c(53061L, 53061L, 53061L, 17031L, 53061L, 17031L, 6059L),
cases = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), deaths = c(0L, 0L, 0L, 0L, 0L, 0L, 0L
)), class = "data.frame", row.names = c(NA, -7L))