Question

这是我一年前在这里提出的一个问题的跟进：How can I extract info from xml page with R

建议的解决方案已经工作了很长时间。不幸的是，在它顺利运作之后，我从来没有想过。现在R向我抛出一个错误，我显然不知道如何继续。

这就是我想要做的事情：

require(XML)
require(RCurl)

url <- "http://ws.parlament.ch/votes/councillors?affairNumberFilter=20130051&format=xml"
affairs_det <- getURL(url, .opts=c(user_agent("Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"), 
                            verbose()), asNames=TRUE)  
#This worked, but not anymore
Error in function (type, msg, asError = TRUE)  : No URL set!
In addition: Warning message:
In mapCurlOptNames(names(.els), asNames = TRUE) :
Unrecognized CURL options: output, auth_token, options, fields, headers, method, url

affairs_det_parsed <- xmlTreeParse(substr(affairs_det,4,nchar(affairs_det)), encoding = "UTF-8")

问题在某种程度上是双重的。首先，我应该如何下载似乎是xml的文件，但如果我用download.file(url, destfile="test.xml")下载它似乎是html？我相信user_agent的设置处理了那个......？

其次，我不明白这个错误？

修改

我想通过代码访问这些信息，例如id。在mysterios错误之前，这也很有用。

infofile <- xmlRoot(affairs_det_parsed)

#gets councillor ids
id <- getNodeSet(infofile, paste0("//councillors/councillor/id"))
id <- lapply(id, function(x) xmlSApply(x, xmlValue))
id <- sapply(id, "[[", 1)

谢谢！

Answer 1

原来的答案混合了RCurl和httr语法，这很奇怪。上面的代码片段忽略了表示使用httr。可能httr已经改变但继续与自己合作，但并不认为它会与RCurl一起使用。

library(httr)
x = GET(url)

检索文件。

stop_for_status(x)

检查没有错误。

xml = content(x)

获取XML内容。或者，下载到磁盘并使用XML来解析它

t <- tempfile()
GET(url, write_disk(t))
xml = xmlParse(t)

Answer 2

好吧，我几乎把XML变成了R而不是HTML。我认为这会有所帮助。

使用XML而不是HTML进行解析会更可靠（同时请记住，您的源代码正在为HTML提供错误）XML文件很简单，因此编写xpath会更容易。

我首先使用命令行卷曲，因为我对它更熟悉。此命令行以XML格式提取：

curl -H "Accept: application/xml"\
     -H "Content-Type: application/xml"\
     -X GET http://ws.parlament.ch/votes/councillors?affairNumberFilter=20130051&format=xml

我将其转换为测试URI存在的Rcurl，然后将其加载到doc：

if(url.exists("http://ws.parlament.ch/votes/councillors?affairNumberFilter=20130051&format=xml")) 
{
    curl = getCurlHandle()
    curlSetOpt( .opts = list(httpheader = c(Accept ="application/xml", "Content-Type"="application/xml"), verbose = TRUE),curl = curl)
    doc = getURL("http://ws.parlament.ch/votes/councillors?affairNumberFilter=20130051&format=xml", curl = curl)
}

但是xmlParse会说错误Error: XML content does not seem to be XML。对下载文件进行目视检查会显示前导垃圾字符，特别是"ï»¿。我认为在进一步处理之前需要解决这个问题。

这很有意思，因为命令行Curl没有那些流浪的主角。

也许有经验的人可以更进一步。

跟进：如何下载xml，当它以某种方式是html

2 个答案: