我试图直接从URL读取以获取包含管道分隔文本文件的zip文件。如果我下载文件,然后使用read_csv
从磁盘读取它,我没有问题。但是,如果我尝试使用read_csv
直接读取URL,我会在生成的df中得到垃圾。我可以通过下载编码然后阅读来解决这个问题。但它似乎应该直接起作用。关于这里发生了什么的任何线索?
library(readr)
url <- "https://www.rma.usda.gov/data/sob/sccc/sobcov_2018.zip"
df <- read_delim(url, delim='|',
col_names = c('year','stFips','stAbbr','coFips','coName',
'cropCd','cropName','planCd','planAbbr','coverCat',
'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
'liab','prem','subsidy','indem', 'lossRatio'))
#> Parsed with column specification:
#> cols(
#> .default = col_character()
#> )
#> See spec(...) for full column specifications.
#> Warning in rbind(names(probs), probs_f): number of columns of result is not
#> a multiple of vector length (arg 1)
#> Warning: 7908 parsing failures.
#> row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 1 year "" embedded null 'https://www.rma.usda.gov/data/sob… file 2 1 <NA> 25 columns 1 columns 'https://www.rma.usda.gov/data/sob… row 3 2 <NA> 25 columns 4 columns 'https://www.rma.usda.gov/data/sob… col 4 3 <NA> 25 columns 2 columns 'https://www.rma.usda.gov/data/sob… expected 5 4 year "" embedded null 'https://www.rma.usda.gov/data/sob…
#> ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
#> See problems(...) for more details.
head(df)
#> # A tibble: 6 x 25
#> year stFips stAbbr coFips coName cropCd cropName planCd planAbbr
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 "PK\u00… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 "K\xe6\… "\xf5\x… "\xc5\… "\xfa\… <NA> <NA> <NA> <NA> <NA>
#> 3 "\xb0\x… "\xfd\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 "j`/Q\x… "\x96\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 "\xc0\x… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 "z\xe4\… "~y\xf5… <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> # ... with 16 more variables: coverCat <chr>, deliveryType <chr>,
#> # covLevel <chr>, policyCount <chr>, policyPremCount <chr>,
#> # policyIndemCount <chr>, unitsReportingPrem <chr>, indemCount <chr>,
#> # quantType <chr>, quantNet <chr>, companionAcres <chr>, liab <chr>,
#> # prem <chr>, subsidy <chr>, indem <chr>, lossRatio <chr>
如果我先下载,我会得到以下输出:
> url <- './data/sobcov_2018.zip'
> df <- read_delim(url, delim='|',
+ col_names = c('year','stFips','stAbbr','coFips','coName',
+ 'cropCd','cropName','planCd','planAbbr','coverCat',
+ 'deliveryType','covLevel','policyCount','policyPremCount','policyIndemCount',
+ 'unitsReportingPrem', 'indemCount','quantType', 'quantNet', 'companionAcres',
+ 'liab','prem','subsidy','indem', 'lossRatio'))
Parsed with column specification:
cols(
.default = col_integer(),
stFips = col_character(),
stAbbr = col_character(),
coFips = col_character(),
coName = col_character(),
cropCd = col_character(),
cropName = col_character(),
planCd = col_character(),
planAbbr = col_character(),
coverCat = col_character(),
deliveryType = col_character(),
covLevel = col_double(),
quantType = col_character(),
lossRatio = col_double()
)
See spec(...) for full column specifications.
> head(df)
# A tibble: 6 x 25
year stFips stAbbr coFips coName cropCd cropName planCd planAbbr coverCat deliveryType covLevel
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2018 02 AK 999 "All Other … 9999 "All Other C… 01 "YP … "A " RBUP 0.500
2 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "A " RBUP 0.500
3 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "A " RBUP 0.750
4 2018 02 AK 240 "Southeast … 9999 "All Other C… 90 "APH … "C " RCAT 0.500
5 2018 02 AK 240 "Southeast … 9999 "All Other C… 02 "RP … "A " RBUP 0.600
6 2018 02 AK 240 "Southeast … 9999 "All Other C… 02 "RP … "A " RBUP 0.750
# ... with 13 more variables: policyCount <int>, policyPremCount <int>, policyIndemCount <int>,
# unitsReportingPrem <int>, indemCount <int>, quantType <chr>, quantNet <int>, companionAcres <int>,
# liab <int>, prem <int>, subsidy <int>, indem <int>, lossRatio <dbl>
>
答案 0 :(得分:3)
readr
只能处理gz
个压缩文件作为远程源,因为其他压缩算法没有base::gzcon()
的类似物。有关讨论和this github issue(也在?readr::datasource
中),请参阅improved documentation。