Pivot_Longer正在创建比预期更多的行

时间:2020-09-09 13:39:12

标签: r pivot tidyr

我正在使用一个数据集,该数据集记录政府如何通过政策应对冠状病毒。 出于绘图目的,我使用ivot_longer将所有策略包含在一个列中,并将它们的对应值包含在另一列中。

要检查此方法是否正常运行,我已针对一个特定的国家(英国)和一项特定的政策(学校关闭)进行了过滤。应该有253个值(截至2020年9月9日),但是由于某种原因,该值是该值的5倍。我相信前253个值是正确的,但我不知道如何创建额外的值。我尝试了多种方法来解决此问题,但我没有任何运气。如果有人可以向我解释我做错了什么以及如何解决,我将非常感谢。谢谢。

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

response <- read.csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv", fileEncoding = "UTF-8-BOM")


## setting correct date format with lubridate 
response$Date <- ymd(response$Date)

##Removing some variables from the dataset 
response <- response %>%
  select(-contains(c("Notes", "IsGeneral", "StringencyIndex", "Flag", "Stringency", "HealthIndex", "SupportIndex",
                     "ResponseIndex", "RegionName", "RegionCode", "CountryCode")))

## a small preview of the dataset
head(response[, c(1:4)])
#>   CountryName       Date C1_School.closing C2_Workplace.closing
#> 1       Aruba 2020-01-01                 0                    0
#> 2       Aruba 2020-01-02                 0                    0
#> 3       Aruba 2020-01-03                 0                    0
#> 4       Aruba 2020-01-04                 0                    0
#> 5       Aruba 2020-01-05                 0                    0
#> 6       Aruba 2020-01-06                 0                    0


pivot <- response %>%
  pivot_longer(
    cols = C1_School.closing:M1_Wildcard,
    names_to = "policy",
    values_to = "value"
  )

## each country should have 253 rows for each policy 

pivot %>%
  filter(CountryName == "United Kingdom",
         policy == "C1_School.closing")
#> # A tibble: 1,265 x 6
#>    CountryName    Date       ConfirmedCases ConfirmedDeaths policy         value
#>    <chr>          <date>              <int>           <int> <chr>          <dbl>
#>  1 United Kingdom 2020-01-01              0               0 C1_School.clo~     0
#>  2 United Kingdom 2020-01-02              0               0 C1_School.clo~     0
#>  3 United Kingdom 2020-01-03              0               0 C1_School.clo~     0
#>  4 United Kingdom 2020-01-04              0               0 C1_School.clo~     0
#>  5 United Kingdom 2020-01-05              0               0 C1_School.clo~     0
#>  6 United Kingdom 2020-01-06              0               0 C1_School.clo~     0
#>  7 United Kingdom 2020-01-07              0               0 C1_School.clo~     0
#>  8 United Kingdom 2020-01-08              0               0 C1_School.clo~     0
#>  9 United Kingdom 2020-01-09              0               0 C1_School.clo~     0
#> 10 United Kingdom 2020-01-10              0               0 C1_School.clo~     0
#> # ... with 1,255 more rows

## there are 5x as many rows as needed. 
## there should only be 253 days of data for one policy and one country 

reprex package(v0.3.0)于2020-09-09创建

2 个答案:

答案 0 :(得分:1)

深入研究您的数据后,我发现有重复的日期。这是一个主要问题,作为数据分析师,您必须知道该怎么做。我提供了一个代码解决方案,该解决方案允许使用id变量标识重复的日期,以便您可以过滤出正确的日期。这里的代码:

library(tidyverse)
library(lubridate)
#Load data
response <- read.csv("https://raw.githubusercontent.com/OxCGRT/covid-policy-tracker/master/data/OxCGRT_latest.csv", fileEncoding = "UTF-8-BOM")

## setting correct date format with lubridate 
response$Date <- ymd(response$Date)

##Removing some variables from the dataset 
response <- response %>%
  select(-contains(c("Notes", "IsGeneral", "StringencyIndex", "Flag", "Stringency", "HealthIndex", "SupportIndex",
                     "ResponseIndex", "RegionName", "RegionCode", "CountryCode")))

接下来,我们将按每个国家和日期标识重复的行,并将其保存到response2

#Mutate
response %>% 
  arrange(CountryName,Date) %>%
  group_by(CountryName,Date ) %>%
  mutate(id=1:n()) -> response2

现在,我们将重塑数据:

#Reshape
pivot <- response2 %>%
  pivot_longer(
    cols = -c(CountryName,Date,id),
    names_to = "policy",
    values_to = "value"
  )

然后,您必须决定选择哪个日期。在这里,我将选择第一个日期(id==1):

#Code
example <- pivot %>%
  filter(CountryName == "United Kingdom",
         policy == "C1_School.closing",id==1)

输出:

# A tibble: 253 x 5
# Groups:   CountryName, Date [253]
   CountryName    Date          id policy            value
   <fct>          <date>     <int> <chr>             <dbl>
 1 United Kingdom 2020-01-01     1 C1_School.closing     0
 2 United Kingdom 2020-01-02     1 C1_School.closing     0
 3 United Kingdom 2020-01-03     1 C1_School.closing     0
 4 United Kingdom 2020-01-04     1 C1_School.closing     0
 5 United Kingdom 2020-01-05     1 C1_School.closing     0
 6 United Kingdom 2020-01-06     1 C1_School.closing     0
 7 United Kingdom 2020-01-07     1 C1_School.closing     0
 8 United Kingdom 2020-01-08     1 C1_School.closing     0
 9 United Kingdom 2020-01-09     1 C1_School.closing     0
10 United Kingdom 2020-01-10     1 C1_School.closing     0
# ... with 243 more rows

具有所需的预期行数。

答案 1 :(得分:1)

如Duck的回答所述,您的数据中有重复的日期。这是由于某些国家/地区每个国家/地区有多行,反映了同一国家/地区的不同地区。使用github repo中建议的数据描述方法,您可以清除数据,以便仅保留汇总的国家/地区级别。

为此,请修改您的代码以保留在RegionCode列中,并仅过滤具有空区域代码的条目:

response <- response %>%
  select(-contains(c("Notes", "IsGeneral", "StringencyIndex", "Flag", "Stringency", "HealthIndex", "SupportIndex",
                     "ResponseIndex", "RegionName", "CountryCode"))) %>% 
  filter(RegionCode == "")

现在旋转将产生您期望的结果:

pivot <- response %>%
  pivot_longer(
    cols = C1_School.closing:M1_Wildcard,
    names_to = "policy",
    values_to = "value"
  )
pivot %>%
  filter(CountryName == "United Kingdom",
         policy == "C1_School.closing")

结果:

# A tibble: 253 x 7
   CountryName    RegionCode Date       ConfirmedCases ConfirmedDeaths policy            value
   <chr>          <chr>      <date>              <int>           <int> <chr>             <dbl>
 1 United Kingdom ""         2020-01-01              0               0 C1_School.closing     0
 2 United Kingdom ""         2020-01-02              0               0 C1_School.closing     0
 3 United Kingdom ""         2020-01-03              0               0 C1_School.closing     0
 4 United Kingdom ""         2020-01-04              0               0 C1_School.closing     0
 5 United Kingdom ""         2020-01-05              0               0 C1_School.closing     0
 6 United Kingdom ""         2020-01-06              0               0 C1_School.closing     0
 7 United Kingdom ""         2020-01-07              0               0 C1_School.closing     0
 8 United Kingdom ""         2020-01-08              0               0 C1_School.closing     0
 9 United Kingdom ""         2020-01-09              0               0 C1_School.closing     0
10 United Kingdom ""         2020-01-10              0               0 C1_School.closing     0
# ... with 243 more rows