如何使用rRE使用RSelenium访问页面?

时间:2017-09-03 00:44:09

标签: r web-scraping html-parsing rvest rselenium

我正在尝试使用angular.js抓取一个网页。我的理解是R中唯一的选择是使用RSelenium首先加载页面,然后解析内容。但是,我发现rvest比RSelenium更直观地解析内容,因此我希望尽可能少地使用RSelenium,然后尽快切换到rvest

到目前为止,我已经意识到我可能至少需要使用RSelenium来连接并使用htmlTreeParse下载html代码。假设这是我输出的一部分:

structure(list(name = "div", attributes = structure(c("im_dialog_date", 
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
    text = structure(list(name = "text", attributes = NULL, children = NULL, 
        namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name", 
    "attributes", "children", "namespace", "namespaceDefinitions", 
    "value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode", 
    "XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL, 
    namespaceDefinitions = NULL), .Names = c("name", "attributes", 
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode", 
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))

如何将其传递给rvest::read_html()

1 个答案:

答案 0 :(得分:2)

如果您查看商品的类别,则它是XMLNode,这是XML包定义的类。在其中,它定义了toString(但不是as.character,奇怪地)的方法,允许您将节点转换为普通字符串,而该字符串又可以由xml2::read_html读入:

library(rvest)
#> Loading required package: xml2

node <- structure(list(name = "div", attributes = structure(c("im_dialog_date", 
"dialogMessage.dateText"), .Names = c("class", "ng-bind")), children = structure(list(
    text = structure(list(name = "text", attributes = NULL, children = NULL, 
        namespace = NULL, namespaceDefinitions = NULL, value = "6:52 PM"), .Names = c("name", 
    "attributes", "children", "namespace", "namespaceDefinitions", 
    "value"), class = c("XMLTextNode", "XMLNode", "RXMLAbstractNode", 
    "XMLAbstractNode", "oldClass"))), .Names = "text"), namespace = NULL, 
    namespaceDefinitions = NULL), .Names = c("name", "attributes", 
"children", "namespace", "namespaceDefinitions"), class = c("XMLNode", 
"RXMLAbstractNode", "XMLAbstractNode", "oldClass"))

node %>% XML::toString.XMLNode() %>% read_html()
#> {xml_document}
#> <html>
#> [1] <body><div class="im_dialog_date" ng-bind="dialogMessage.dateText">6 ...

也就是说,我通常只使用RSelenium::remoteDriver的{​​{1}}方法来获取所有HTML,然后使用rvest轻松解析。