Question

我尝试从网站上提取一些信息

library(rvest)
library(XML)
url <- "http://wiadomosci.onet.pl/wybory-prezydenckie/xcnpc"
html <- html(url)

nodes <- html_nodes(html, ".listItemSolr") 
nodes

我得到30个HTML代码的“列表”。我想从“list”提取最后一个href属性的每个元素，所以对于30.元素，它将是

<a href="http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq" title="W sobotę prezentacja hasła i programu wyborczego Komorowskiego">

所以我想得到字符串

"http://wiadomosci.onet.pl/kraj/w-sobote-prezentacja-hasla-i-programu-wyborczego-komorowskiego/tvgcq"

问题是html_attr(nodes, "href")不起作用（我得到NA的矢量）。所以我考虑了正则表达式，但问题是nodes不是字符列表。

class(nodes)
[1] "XMLNodeSet"

我试过

xmlToList(nodes)

但它也不起作用。

所以我的问题是：如何使用为HTML创建的某个函数提取此URL？或者，如果不可能，我如何将XMLNodeSet转换为字符列表？

Answer 1

尝试在节点内搜索＆＃39;小孩：

nodes <- html_nodes(html, ".listItemSolr") 

sapply(html_children(nodes), function(x){
  html_attr( x$a, "href")
})

更新

Hadley建议使用优雅的管道：

html %>%  
  html_nodes(".listItemSolr") %>% 
  html_nodes(xpath = "./a") %>% 
  html_attr("href")

Answer 2

包XML函数getHTMLLinks()几乎可以为我们完成所有工作，我们只需要编写xpath查询。在这里，我们查询所有节点属性以确定是否包含＆＃34; listItemSolr＆＃34;，然后为href查询选择父节点。

getHTMLLinks(url, xpQuery = "//@*[contains(., 'listItemSolr')]/../a/@href")

在xpQuery我们正在执行以下操作：

//@*[contains(., 'listItemSolr')]查询 listItemSolr的所有节点属性
/..选择父节点
/a/@href获取href链接

提取href attr或将节点转换为字符列表

2 个答案:

更新