我有一些国家的covid-19案例的累积数据,我正在尝试在名为Diff的新列中计算差异。我无法删除NA值,因为它不会显示未执行任何测试的日期。因此,我将其设置为如果存在NA值,则将Diff值设置为0以表示没有差异,因此当天没有进行任何测试。
我也在尝试声明,如果Diff也是NA,表示前一天没有进行测试,则将差异设置为当天确认的病例值。
从底部的结果可以看出,我快到了,但是我正在创建一个名为ifelse的新列。我试图解决此问题,但我认为我在某个地方犯了一个简单的错误。如果有人可以向我指出,我将非常感谢,谢谢。
编辑:我意识到在将滞后计算= NA时将日常案件设置为已确认案件的想法引起了逻辑上的错误,因为这会产生误导性的答案。
我在大型数据集上使用以下代码填充并在出现NA时重复前面的值。我按组筛选,以免在各个国家/地区简单地传播前瞻性价值。
然后我计算了时滞,然后使用Ronak Shah的代码来获取每日值。
data <- data %>%
group_by(CountryName) %>%
fill(ConfirmedCases, .direction = "down")
data <- data %>%
mutate(lag1 = ConfirmedCases - lag(ConfirmedCases))
data <- data %>% mutate(DailyCases = replace_na(coalesce(lag1, ConfirmedCases), 0))
library(tidyverse)
data <- data.frame(
stringsAsFactors = FALSE,
CountryName = c("Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan","Afghanistan",
"Afghanistan","Afghanistan"),
ConfirmedCases = c(NA,7L,NA,NA,NA,10L,16L,21L,
22L,22L,22L,24L,24L,34L,40L,42L,
75L,75L,91L,106L,114L,141L,166L,
192L,235L,235L,270L,299L,337L,367L,
423L),
Diff = c(NA,NA,NA,NA,NA,NA,6L,5L,1L,
0L,0L,2L,0L,10L,6L,2L,33L,0L,16L,
15L,8L,27L,25L,26L,43L,0L,35L,
29L,38L,30L,56L)
)
data2 <- data %>%
mutate(Diff = ifelse(is.na(ConfirmedCases) == TRUE, 0, ConfirmedCases - lag(ConfirmedCases)),
ifelse(is.na((ConfirmedCases - lag(ConfirmedCases))) == TRUE, ConfirmedCases, ConfirmedCases - lag(ConfirmedCases)))
head(data2, 10)
#> CountryName ConfirmedCases Diff ifelse(...)
#> 1 Afghanistan NA 0 NA
#> 2 Afghanistan 7 NA 7
#> 3 Afghanistan NA 0 NA
#> 4 Afghanistan NA 0 NA
#> 5 Afghanistan NA 0 NA
#> 6 Afghanistan 10 NA 10
#> 7 Afghanistan 16 6 6
#> 8 Afghanistan 21 5 5
#> 9 Afghanistan 22 1 1
#> 10 Afghanistan 22 0 0
由reprex package(v0.3.0)于2020-08-15创建
答案 0 :(得分:1)
也许这可以通过创建目标列的副本来提供帮助:
library(tidyverse)
data %>% mutate(D=ConfirmedCases,D=ifelse(is.na(D),0,D),
Diff2 = c(0,diff(D)),Diff2=ifelse(Diff2<0,0,Diff2)) %>% select(-D)
输出:
CountryName ConfirmedCases Diff Diff2
1 Afghanistan NA NA 0
2 Afghanistan 7 NA 7
3 Afghanistan NA NA 0
4 Afghanistan NA NA 0
5 Afghanistan NA NA 0
6 Afghanistan 10 NA 10
7 Afghanistan 16 6 6
8 Afghanistan 21 5 5
9 Afghanistan 22 1 1
10 Afghanistan 22 0 0
11 Afghanistan 22 0 0
12 Afghanistan 24 2 2
13 Afghanistan 24 0 0
14 Afghanistan 34 10 10
15 Afghanistan 40 6 6
16 Afghanistan 42 2 2
17 Afghanistan 75 33 33
18 Afghanistan 75 0 0
19 Afghanistan 91 16 16
20 Afghanistan 106 15 15
21 Afghanistan 114 8 8
22 Afghanistan 141 27 27
23 Afghanistan 166 25 25
24 Afghanistan 192 26 26
25 Afghanistan 235 43 43
26 Afghanistan 235 0 0
27 Afghanistan 270 35 35
28 Afghanistan 299 29 29
29 Afghanistan 337 38 38
30 Afghanistan 367 30 30
31 Afghanistan 423 56 56
答案 1 :(得分:1)
我认为您可以使用coalesce
从Diff
和ConfirmedCases
获取第一个非NA值,如果两个都是NA
,则将其替换为0。
library(dplyr)
data %>%
mutate(Diff2 = tidyr::replace_na(coalesce(Diff, ConfirmedCases), 0))
# CountryName ConfirmedCases Diff Diff2
#1 Afghanistan NA NA 0
#2 Afghanistan 7 NA 7
#3 Afghanistan NA NA 0
#4 Afghanistan NA NA 0
#5 Afghanistan NA NA 0
#6 Afghanistan 10 NA 10
#7 Afghanistan 16 6 6
#8 Afghanistan 21 5 5
#9 Afghanistan 22 1 1
#10 Afghanistan 22 0 0
#11 Afghanistan 22 0 0
#12 Afghanistan 24 2 2
#...
#...