使用html_nodes检测不到Rvest节点

时间:2017-12-02 19:26:31

标签: css r rvest rselenium

我不太明白为什么我不能在Rvest的某些网站上使用选择器。

示例:

url <- read_html("http://www.cbc.ca/news/politics")

headlines <- url %>%
html_nodes(".headline") %>%
html_text()

另一个例子:

library(RSelenium)

rD <- rsDriver(verbose = FALSE)
rD
remDr <- rD$client

url <- "http://www.cbc.ca/news/politics"
remDr$navigate(url)

remDr$getTitle()
remDr$getCurrentUrl()

webElem <- remDr$findElement(using = "class", value = 'headline')

webElem$getElementAttribute("class")

remDr$close()
rD$server$stop()

应该很简单。当我看到这个结构时,这些标题都属于课堂标题。除此之外,还有类卡内容,卡片内容顶部,但没有css选择器和xpath的组合似乎可行。

1 个答案:

答案 0 :(得分:1)

由于selectr包有一些问题(至少在Debian上),CSS选择器可能无法在rvest中工作,有关详细信息,请参阅此内容: https://github.com/sjp/selectr/issues/7

使用SelectorGadget和Chrome Developer工具,我使用以下xpath从网页中查找并识别“标题”。有关如何找到正确xpath的更多信息,请访问: https://medium.com/@peterjgensler/functions-with-r-and-rvest-a-laymens-guide-acda42325a77

library('rvest') 
library('magrittr') 
url <- read_html("http://www.cbc.ca/news/politics")


headlines <- url %>%   
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "pinnableHeadline", " " ))]') %>% 
html_text()

headlines[1]
"On Trudeau's 2nd trip to China, time may be ripe to advance free 
trade"
headlines[2]
"Liberals want to be global leader on open government, but face complaints at home"