使用R进行网页抓取:gz / csv文件

时间:2020-07-18 20:35:34

标签: r web-scraping

我正在尝试读取此链接上的档案: COVID CSV

我正在使用read.csv,但它似乎不起作用:

read.table(file = "https://data.brasil.io/dataset/covid19/caso.csv.gz")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
line 1 did not have 3 elements

我正在尝试构建一个代码,以使用COVID信息从该网站提取数据,因此我不必每次都不想使用它时就下载它。

2 个答案:

答案 0 :(得分:0)

我们可以使用fread

library(data.table)
fread("https://data.brasil.io/dataset/covid19/caso.csv.gz")
#             date state city place_type confirmed deaths order_for_place is_last estimated_population_2019
#     1: 2020-07-17    AP           state     33436    499             119    TRUE                    845731
#     2: 2020-07-16    AP           state     33004    493             118   FALSE                    845731
#     3: 2020-07-15    AP           state     32408    488             117   FALSE                    845731
#     4: 2020-07-14    AP           state     31885    483             116   FALSE                    845731
#     5: 2020-07-13    AP           state     31552    478             115   FALSE                    845731
#    ---                                                                                                    
#372166: 2020-06-23    SP Óleo       city         1      0               5   FALSE                      2496
#372167: 2020-06-22    SP Óleo       city         1      0               4   FALSE                      2496
#372168: 2020-06-21    SP Óleo       city         1      0               3   FALSE                      2496
#372169: 2020-06-20    SP Óleo       city         1      0               2   FALSE                      2496
#372170: 2020-06-19    SP Óleo       city         1      0               1   FALSE                      2496
#        city_ibge_code confirmed_per_100k_inhabitants death_rate
#     1:             16                      3953.5030     0.0149
#     2:             16                      3902.4229     0.0149
#     3:             16                      3831.9513     0.0151
#     4:             16                      3770.1113     0.0151
#     5:             16                      3730.7371     0.0151
#    ---                                                         
#372166:        3533809                        40.0641     0.0000
#372167:        3533809                        40.0641     0.0000
#372168:        3533809                        40.0641     0.0000
#372169:        3533809                        40.0641     0.0000
#372170:        3533809                        40.0641     0.0000

答案 1 :(得分:0)

似乎可以与readr::read_csv

一起使用
readr::read_csv("https://data.brasil.io/dataset/covid19/caso.csv.gz")

# A tibble: 376,064 x 12
#   date       state city  place_type confirmed deaths order_for_place is_last
#   <date>     <chr> <chr> <chr>          <dbl>  <dbl>           <dbl> <lgl>  
# 1 2020-07-18 AC    NA    state          17202    457             124 TRUE   
# 2 2020-07-17 AC    NA    state          16965    452             123 FALSE  
# 3 2020-07-16 AC    NA    state          16865    447             122 FALSE  
# 4 2020-07-15 AC    NA    state          16672    446             121 FALSE  
# 5 2020-07-14 AC    NA    state          16479    436             120 FALSE  
# 6 2020-07-13 AC    NA    state          16260    430             119 FALSE  
# 7 2020-07-12 AC    NA    state          16190    426             118 FALSE  
# 8 2020-07-11 AC    NA    state          16080    419             117 FALSE  
# 9 2020-07-10 AC    NA    state          15768    417             116 FALSE  
#10 2020-07-09 AC    NA    state          15465    411             115 FALSE  
# … with 376,054 more rows, and 4 more variables:
#   estimated_population_2019 <dbl>, city_ibge_code <dbl>,
#   confirmed_per_100k_inhabitants <dbl>, death_rate <dbl>