我正在使用一个数据集,该数据集记录政府如何通过政策应对冠状病毒。 出于绘图目的,我使用ivot_longer将所有策略包含在一个列中,并将它们的对应值包含在另一列中。
要检查此方法是否正常运行,我已针对一个特定的国家(英国)和一项特定的政策(学校关闭)进行了过滤。应该有253个值(截至2020年9月9日),但是由于某种原因,该值是该值的5倍。我相信前253个值是正确的,但我不知道如何创建额外的值。我尝试了多种方法来解决此问题,但我没有任何运气。如果有人可以向我解释我做错了什么以及如何解决,我将非常感谢。谢谢。
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
response <- read.csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv", fileEncoding = "UTF-8-BOM")
## setting correct date format with lubridate
response$Date <- ymd(response$Date)
##Removing some variables from the dataset
response <- response %>%
select(-contains(c("Notes", "IsGeneral", "StringencyIndex", "Flag", "Stringency", "HealthIndex", "SupportIndex",
"ResponseIndex", "RegionName", "RegionCode", "CountryCode")))
## a small preview of the dataset
head(response[, c(1:4)])
#> CountryName Date C1_School.closing C2_Workplace.closing
#> 1 Aruba 2020-01-01 0 0
#> 2 Aruba 2020-01-02 0 0
#> 3 Aruba 2020-01-03 0 0
#> 4 Aruba 2020-01-04 0 0
#> 5 Aruba 2020-01-05 0 0
#> 6 Aruba 2020-01-06 0 0
pivot <- response %>%
pivot_longer(
cols = C1_School.closing:M1_Wildcard,
names_to = "policy",
values_to = "value"
)
## each country should have 253 rows for each policy
pivot %>%
filter(CountryName == "United Kingdom",
policy == "C1_School.closing")
#> # A tibble: 1,265 x 6
#> CountryName Date ConfirmedCases ConfirmedDeaths policy value
#> <chr> <date> <int> <int> <chr> <dbl>
#> 1 United Kingdom 2020-01-01 0 0 C1_School.clo~ 0
#> 2 United Kingdom 2020-01-02 0 0 C1_School.clo~ 0
#> 3 United Kingdom 2020-01-03 0 0 C1_School.clo~ 0
#> 4 United Kingdom 2020-01-04 0 0 C1_School.clo~ 0
#> 5 United Kingdom 2020-01-05 0 0 C1_School.clo~ 0
#> 6 United Kingdom 2020-01-06 0 0 C1_School.clo~ 0
#> 7 United Kingdom 2020-01-07 0 0 C1_School.clo~ 0
#> 8 United Kingdom 2020-01-08 0 0 C1_School.clo~ 0
#> 9 United Kingdom 2020-01-09 0 0 C1_School.clo~ 0
#> 10 United Kingdom 2020-01-10 0 0 C1_School.clo~ 0
#> # ... with 1,255 more rows
## there are 5x as many rows as needed.
## there should only be 253 days of data for one policy and one country
由reprex package(v0.3.0)于2020-09-09创建
答案 0 :(得分:1)
深入研究您的数据后,我发现有重复的日期。这是一个主要问题,作为数据分析师,您必须知道该怎么做。我提供了一个代码解决方案,该解决方案允许使用id变量标识重复的日期,以便您可以过滤出正确的日期。这里的代码:
library(tidyverse)
library(lubridate)
#Load data
response <- read.csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv", fileEncoding = "UTF-8-BOM")
## setting correct date format with lubridate
response$Date <- ymd(response$Date)
##Removing some variables from the dataset
response <- response %>%
select(-contains(c("Notes", "IsGeneral", "StringencyIndex", "Flag", "Stringency", "HealthIndex", "SupportIndex",
"ResponseIndex", "RegionName", "RegionCode", "CountryCode")))
接下来,我们将按每个国家和日期标识重复的行,并将其保存到response2
:
#Mutate
response %>%
arrange(CountryName,Date) %>%
group_by(CountryName,Date ) %>%
mutate(id=1:n()) -> response2
现在,我们将重塑数据:
#Reshape
pivot <- response2 %>%
pivot_longer(
cols = -c(CountryName,Date,id),
names_to = "policy",
values_to = "value"
)
然后,您必须决定选择哪个日期。在这里,我将选择第一个日期(id==1
):
#Code
example <- pivot %>%
filter(CountryName == "United Kingdom",
policy == "C1_School.closing",id==1)
输出:
# A tibble: 253 x 5
# Groups: CountryName, Date [253]
CountryName Date id policy value
<fct> <date> <int> <chr> <dbl>
1 United Kingdom 2020-01-01 1 C1_School.closing 0
2 United Kingdom 2020-01-02 1 C1_School.closing 0
3 United Kingdom 2020-01-03 1 C1_School.closing 0
4 United Kingdom 2020-01-04 1 C1_School.closing 0
5 United Kingdom 2020-01-05 1 C1_School.closing 0
6 United Kingdom 2020-01-06 1 C1_School.closing 0
7 United Kingdom 2020-01-07 1 C1_School.closing 0
8 United Kingdom 2020-01-08 1 C1_School.closing 0
9 United Kingdom 2020-01-09 1 C1_School.closing 0
10 United Kingdom 2020-01-10 1 C1_School.closing 0
# ... with 243 more rows
具有所需的预期行数。
答案 1 :(得分:1)
如Duck的回答所述,您的数据中有重复的日期。这是由于某些国家/地区每个国家/地区有多行,反映了同一国家/地区的不同地区。使用github repo中建议的数据描述方法,您可以清除数据,以便仅保留汇总的国家/地区级别。
为此,请修改您的代码以保留在RegionCode
列中,并仅过滤具有空区域代码的条目:
response <- response %>%
select(-contains(c("Notes", "IsGeneral", "StringencyIndex", "Flag", "Stringency", "HealthIndex", "SupportIndex",
"ResponseIndex", "RegionName", "CountryCode"))) %>%
filter(RegionCode == "")
现在旋转将产生您期望的结果:
pivot <- response %>%
pivot_longer(
cols = C1_School.closing:M1_Wildcard,
names_to = "policy",
values_to = "value"
)
pivot %>%
filter(CountryName == "United Kingdom",
policy == "C1_School.closing")
结果:
# A tibble: 253 x 7
CountryName RegionCode Date ConfirmedCases ConfirmedDeaths policy value
<chr> <chr> <date> <int> <int> <chr> <dbl>
1 United Kingdom "" 2020-01-01 0 0 C1_School.closing 0
2 United Kingdom "" 2020-01-02 0 0 C1_School.closing 0
3 United Kingdom "" 2020-01-03 0 0 C1_School.closing 0
4 United Kingdom "" 2020-01-04 0 0 C1_School.closing 0
5 United Kingdom "" 2020-01-05 0 0 C1_School.closing 0
6 United Kingdom "" 2020-01-06 0 0 C1_School.closing 0
7 United Kingdom "" 2020-01-07 0 0 C1_School.closing 0
8 United Kingdom "" 2020-01-08 0 0 C1_School.closing 0
9 United Kingdom "" 2020-01-09 0 0 C1_School.closing 0
10 United Kingdom "" 2020-01-10 0 0 C1_School.closing 0
# ... with 243 more rows