Question

我有一个用Rcurl编写的小脚本，它将我连接到波兰语语料库并询问目标词频率。但是，此解决方案仅适用于标准字符。如果我用波兰语询问这个词（即“ę”，“ą”），它的回报不匹配。输出日志表明脚本没有正确传输url地址中的字符。

我的剧本：

#slowo = word;
wordCorpusChecker<- function (slowo, korpus=2) {
#this line help me bypass the redirection page after calling for specific word 
curl = getCurlHandle(cookiefile = "", verbose = TRUE, 
                       followlocation=TRUE, encoding = "utf-8")
#standard call for submitting html form
 getForm("http://korpus.pl/poliqarp/poliqarp.php",
          query = slowo, corpus = as.character(korpus), showMatch = "1",
          showContext = "3",leftContext = "5", rightContext = "5", 
          wideContext = "50", hitsPerPage = "10", 
          .opts = curlOptions(
            verbose = T,
            followlocation=TRUE,
            encoding = "utf-8"
          )
          , curl = curl)
#In test2 there is html of page where I can find information I'm interested in 
  test1 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
  test2 <- getURL("http://korpus.pl/poliqarp/poliqarp.php", curl = curl)
#"scrapping" the frequency from html website 
 a<-regexpr("Found <em>", test2)[1]+
        as.integer(attributes(regexpr("Found <em>", test2)))
      b<-regexpr("</em> results<br />\n", test2)[1] - 1
      c<-a:b
      value<-substring(test2, c[1], c[length(c)])
      return(value)

    }

#if you try this you will get nice result about "pies" (dog) frequency in polish corpus 
    wordCorpusChecker("pies")

#if you try this you will get no match because  of the special characters 
    wordCorpusChecker("kałuża")

#the log from `verbose`: 

    GET /poliqarp/poliqarp.php?query=ka%B3u%BFa&corpus=2&showMatch=1&showContext=3&leftContext=5&rightContext=5&wideContext=50&hitsPerPage=10

我尝试指定encoding选项，但是手动说它是指查询的结果。我正在尝试curlUnescape，但没有取得任何积极成果。请咨询。

Answer 1

一种解决方案是指定例如

的utf编码

> "ka\u0142u\u017Ca"
[1] "kałuża"
wordCorpusChecker("ka\u0142u\u017Ca")

[1] "55"

getForm - 如何发送特殊字符？

1 个答案: