我有一个用Rcurl
编写的小脚本,它将我连接到波兰语语料库并询问目标词频率。但是,此解决方案仅适用于标准字符。如果我用波兰语询问这个词(即“ę”,“ą”),它的回报不匹配。输出日志表明脚本没有正确传输url地址中的字符。
我的剧本:
#slowo = word;
wordCorpusChecker<- function (slowo, korpus=2) {
#this line help me bypass the redirection page after calling for specific word
curl = getCurlHandle(cookiefile = "", verbose = TRUE,
followlocation=TRUE, encoding = "utf-8")
#standard call for submitting html form
getForm("http://korpus.pl/poliqarp/poliqarp.php",
query = slowo, corpus = as.character(korpus), showMatch = "1",
showContext = "3",leftContext = "5", rightContext = "5",
wideContext = "50", hitsPerPage = "10",
.opts = curlOptions(
verbose = T,
followlocation=TRUE,
encoding = "utf-8"
)
, curl = curl)
#In test2 there is html of page where I can find information I'm interested in
test1 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
test2 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
#"scrapping" the frequency from html website
a<-regexpr("Found <em>", test2)[1]+
as.integer(attributes(regexpr("Found <em>", test2)))
b<-regexpr("</em> results<br />\n", test2)[1] - 1
c<-a:b
value<-substring(test2, c[1], c[length(c)])
return(value)
}
#if you try this you will get nice result about "pies" (dog) frequency in polish corpus
wordCorpusChecker("pies")
#if you try this you will get no match because of the special characters
wordCorpusChecker("kałuża")
#the log from `verbose`:
GET /poliqarp/poliqarp.php?query=ka%B3u%BFa&corpus=2&showMatch=1&showContext=3&leftContext=5&rightContext=5&wideContext=50&hitsPerPage=10
我尝试指定encoding
选项,但是手动说它是指查询的结果。我正在尝试curlUnescape
,但没有取得任何积极成果。请咨询。
答案 0 :(得分:0)
一种解决方案是指定例如
的utf编码> "ka\u0142u\u017Ca"
[1] "kałuża"
wordCorpusChecker("ka\u0142u\u017Ca")
[1] "55"