卷曲问题:URL中的括号

时间:2013-12-13 18:00:07

标签: r curl

我想在Mac OSX上使用curl从R下载矢量网址:

## URLs
grab = c("http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1079_33994-C81-I620_5-ANI-L056-00001[006154]ready//DA_2011-06-03_STINGA SIMONA_30381371.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1486_67011-C27-I620_6-ANI-L141-00001[045849]ready//DA_2012-05-28_SORIN VASILE_1308151.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34934-C93-I620_6-ANI-L058-00001[005631]ready//DI_2011-05-25_CONSTANTIN CATALIN IONITA_50364334.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1486_66964-C65-I620_5-ANI-L141-00001[045952]ready//DA_2012-05-24_DORINA ORZAC_1312037.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1486_67290-C65-I620_5-ANI-L141-00001[045768]ready//DI_2012-06-01_JIPA CAMELIA_1304833.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34936-C74-I620_7-ANI-L058-00001[005633]ready//DA_2011-06-09_NICOLE MOT_50364493.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34937-C74-I620_7-ANI-L058-00001[005634]ready//DA_2011-06-14_PETRE ECATERINA_50364543.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1566_67978-C85-I780_2-ANI-L144-00001[046398]ready//DA_2012-05-25_RAMONA GHIERGA_1332323.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34936-C74-I620_7-ANI-L058-00001[005633]ready//DA_2011-06-05_LOVIN G. ADINA_50364475.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP2135_40131-C90-I780_3-ANI-L069-00001[009742]ready//DI_2011-05-25_VARTOLOMEI PAUL-CONSTANTIN_467652.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1086_34373-C11-I620_3-ANI-L057-00001[005657]ready//DI_2011-05-16_CAZACU LILIANA_40437536.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34935-C93-I620_6-ANI-L058-00001[005632]ready//DI_2011-06-07_ROSCA EUGEN-CONSTANTIN_50364400.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP181_27399-C11-I780_2-ANI-L051-00001[005421]ready//DI_2010-11-03_DIAMANDI SAVA-CONSTANTIN_40429564.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1151_34936-C74-I620_7-ANI-L058-00001[005633]ready//DI_2011-06-07_ZAMFIRESCU I. IULIA_50364498.pdf", 
         "http://declaratii.integritate.eu/UserFiles/PDFfiles/RP1563_67587-C71-I780_3-ANI-L143-00001[046079]ready//DI_2012-05-21_MAZURU C. EMILIA_1317509.pdf"
)

我的第一次尝试返回HTTP错误400:

## fails on Mac OSX 10.9 (HTTP 400)
## for(x in grab) download.file(x, destfile = gsub("(.*)//D", "D", x))

我了解到this was due to the URLs containing brackets,所以我以这种方式应用globoff修复:

## also fails despite fixing HTTP Err 400 (files are zero-sized)
for(x in grab) download.file(x, destfile = gsub("(.*)//D", "D", x), method = "curl", extra = "--globoff")

...现在文件已下载,但都是空的(零大小)。

我出错了什么?

P.S。我愿意切换到Python或shell来获取文件,但更愿意保持代码100%R。

2 个答案:

答案 0 :(得分:1)

您是否尝试过编码括号的URL?

%5B = [

%5D =]

答案 1 :(得分:0)

有点晚,但URLencode是您用来确保格式正确的网址。

> x <- "http://example.com/[xyz]//file with spaces.pdf"
> URLencode(x)
[1] "http://example.com/%5bxyz%5d//file%20with%20spaces.pdf"