如何在R上编码url

时间:2014-05-28 20:31:13

标签: r web-services rest

2 个答案:

答案 0 :(得分:3)

您似乎想要删除除第一个URL GET数据说明符之外的所有数据,然后对相关数据进行编码。

url <- "..."
library(stringi)
(addr <- stri_replace_all_regex(url, "\\?.*", ""))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES"
args <- stri_match_first_regex(url, "[?&](.*?)=([^&]+)")
(data <- stri_replace_all_regex(
     stri_trans_general(args[,3], "[^a-zA-Z0-9\\-()]Any-Hex/XML"),
        "&#x([0-9a-fA-F]{2});", "%$1"))
## [1] "InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"
(addr <- stri_c(addr, "?", args[,2], "=", data))
## [1] "http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi=InchI%3D1S%2FC21H30O9%2Fc1-11(5-6-21(28)12(2)8-13(23)9-20(21%2C3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2Fh5-8%2C14%2C16-19%2C22%2C25-28H%2C9-10H2%2C1-4H3%2Fb6-5%2B%2C11-7-%2Ft14-%2C16-%2C17%2B%2C18-%2C19%2B%2C21-%2Fm1%2Fs1"

我在这里使用了ICU的音译器(通过stri_trans_general)。除A..Za..z0..9()-之外的所有字符都已转换为十六进制表示 (似乎URLencode无法处理,形式reserved=TRUE的{​​{1}}。然后,每个&#xNN;都被&#xNN;转换为%NN

答案 1 :(得分:2)

以下是两种方法:

1)gsubfn / URLencode 如果u是包含网址的R字符字符串,请尝试此操作。这会在?URLencode之后输入所有内容,用该函数的输出替换输入。请注意,"\\K"会终止缓冲区中的所有内容,以便?本身不会被编码:

library(gsubfn)
gsubfn("\\?\\K(.*)", ~ URLencode(x, TRUE), u, perl = TRUE)

它给出了以下内容(与问题中的输出不同但可能就足够了):

http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3dInchI%3d1S%2fC21H30O9%2fc1-11(5-6-21(28)12(2)8-13(23)9-20(21,3)4)7-15(24)30-19-18(27)17(26)16(25)14(10-22)29-19%2fh5-8,14,16-19,22,25-28H,9-10H2,1-4H3%2fb6-5+,11-7-%2ft14-,16-,17+,18-,19+,21-%2fm1%2fs1%26token%3de4a6d6fb-ae07-4cf6-bae8-c0e6115bc681

2)gsubfn / curlEscape 对于有点不同的输出,继续使用gsubfn试试:

library(RCurl)
gsubfn("\\?\\K(.*)", curlEscape, u, perl = TRUE)

,并提供:

http://www.chemspider.com/inchi.asmx/InChIToSMILES?inchi%3DInchI%3D1S%2FC21H30O9%2Fc1%2D11%285%2D6%2D21%2828%2912%282%298%2D13%2823%299%2D20%2821%2C3%294%297%2D15%2824%2930%2D19%2D18%2827%2917%2826%2916%2825%2914%2810%2D22%2929%2D19%2Fh5%2D8%2C14%2C16%2D19%2C22%2C25%2D28H%2C9%2D10H2%2C1%2D4H3%2Fb6%2D5%2B%2C11%2D7%2D%2Ft14%2D%2C16%2D%2C17%2B%2C18%2D%2C19%2B%2C21%2D%2Fm1%2Fs1%26token%3De4a6d6fb%2Dae07%2D4cf6%2Dbae8%2Dc0e6115bc681

已添加 curlEscape方法