直接用read_csv从readr读取zip文件,产生奇怪的结果

时间:2018-05-18 15:47:40

标签: r readr

我试图直接从URL读取以获取包含管道分隔文本文件的zip文件。如果我下载文件,然后使用read_csv从磁盘读取它,我没有问题。但是,如果我尝试使用read_csv直接读取URL,我会在生成的df中得到垃圾。我可以通过下载编码然后阅读来解决这个问题。但它似乎应该直接起作用。关于这里发生了什么的任何线索?

library(readr)
url <- "https://www.rma.usda.gov/data/sob/sccc/sobcov_2018.zip"
df <- read_delim(url, delim='|',
                 col_names = c('year','stFips','stAbbr','coFips','coName',
                               'cropCd','cropName','planCd','planAbbr','coverCat',
                               'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
                               'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
                               'liab','prem','subsidy','indem', 'lossRatio'))
#> Parsed with column specification:
#> cols(
#>   .default = col_character()
#> )
#> See spec(...) for full column specifications.
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 7908 parsing failures.
#> row # A tibble: 5 x 5 col     row col   expected   actual        file                                expected   <int> <chr> <chr>      <chr>         <chr>                               actual 1     1 year  ""         embedded null 'https://www.rma.usda.gov/data/sob… file 2     1 <NA>  25 columns 1 columns     'https://www.rma.usda.gov/data/sob… row 3     2 <NA>  25 columns 4 columns     'https://www.rma.usda.gov/data/sob… col 4     3 <NA>  25 columns 2 columns     'https://www.rma.usda.gov/data/sob… expected 5     4 year  ""         embedded null 'https://www.rma.usda.gov/data/sob…
#> ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
#> See problems(...) for more details.
head(df)
#> # A tibble: 6 x 25
#>   year     stFips   stAbbr  coFips  coName cropCd cropName planCd planAbbr
#>   <chr>    <chr>    <chr>   <chr>   <chr>  <chr>  <chr>    <chr>  <chr>   
#> 1 "PK\u00… <NA>     <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> 2 "K\xe6\… "\xf5\x… "\xc5\… "\xfa\… <NA>   <NA>   <NA>     <NA>   <NA>    
#> 3 "\xb0\x… "\xfd\x… <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> 4 "j`/Q\x… "\x96\x… <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> 5 "\xc0\x… <NA>     <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> 6 "z\xe4\… "~y\xf5… <NA>    <NA>    <NA>   <NA>   <NA>     <NA>   <NA>    
#> # ... with 16 more variables: coverCat <chr>, deliveryType <chr>,
#> #   covLevel <chr>, policyCount <chr>, policyPremCount <chr>,
#> #   policyIndemCount <chr>, unitsReportingPrem <chr>, indemCount <chr>,
#> #   quantType <chr>, quantNet <chr>, companionAcres <chr>, liab <chr>,
#> #   prem <chr>, subsidy <chr>, indem <chr>, lossRatio <chr>

如果我先下载,我会得到以下输出:

> url <- './data/sobcov_2018.zip'
> df <- read_delim(url, delim='|',
+                  col_names = c('year','stFips','stAbbr','coFips','coName',
+                                'cropCd','cropName','planCd','planAbbr','coverCat',
+                                'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
+                                'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
+                                'liab','prem','subsidy','indem', 'lossRatio'))
Parsed with column specification:
cols(
  .default = col_integer(),
  stFips = col_character(),
  stAbbr = col_character(),
  coFips = col_character(),
  coName = col_character(),
  cropCd = col_character(),
  cropName = col_character(),
  planCd = col_character(),
  planAbbr = col_character(),
  coverCat = col_character(),
  deliveryType = col_character(),
  covLevel = col_double(),
  quantType = col_character(),
  lossRatio = col_double()
)
See spec(...) for full column specifications.
> head(df)
# A tibble: 6 x 25
   year stFips stAbbr coFips coName       cropCd cropName      planCd planAbbr coverCat deliveryType covLevel
  <int> <chr>  <chr>  <chr>  <chr>        <chr>  <chr>         <chr>  <chr>    <chr>    <chr>           <dbl>
1  2018 02     AK     999    "All Other … 9999   "All Other C… 01     "YP    … "A    "  RBUP            0.500
2  2018 02     AK     240    "Southeast … 9999   "All Other C… 90     "APH   … "A    "  RBUP            0.500
3  2018 02     AK     240    "Southeast … 9999   "All Other C… 90     "APH   … "A    "  RBUP            0.750
4  2018 02     AK     240    "Southeast … 9999   "All Other C… 90     "APH   … "C    "  RCAT            0.500
5  2018 02     AK     240    "Southeast … 9999   "All Other C… 02     "RP    … "A    "  RBUP            0.600
6  2018 02     AK     240    "Southeast … 9999   "All Other C… 02     "RP    … "A    "  RBUP            0.750
# ... with 13 more variables: policyCount <int>, policyPremCount <int>, policyIndemCount <int>,
#   unitsReportingPrem <int>, indemCount <int>, quantType <chr>, quantNet <int>, companionAcres <int>,
#   liab <int>, prem <int>, subsidy <int>, indem <int>, lossRatio <dbl>
> 

1 个答案:

答案 0 :(得分:3)

readr只能处理gz个压缩文件作为远程源,因为其他压缩算法没有base::gzcon()的类似物。有关讨论和this github issue(也在?readr::datasource中),请参阅improved documentation