从R中的RVest包使用html_nodes时获取{xml_nodeset(0)}

时间:2018-08-11 15:48:18

标签: r web-scraping rvest

我正在尝试使用html_node函数和SelectorGadget刮掉一些新闻网站的头条新闻,但发现有些结果不正确,给出的结果为“ {xml_nodeset(0)}”。例如,以下代码给出了这样的结果:

url_cnn = 'https://edition.cnn.com/'
webpage_cnn = read_html(url_cnn)
headlines_html_cnn = html_nodes(webpage_cnn,'.cd__headline-text')
headlines_html_cnn

我使用SelectorGadget获得的“ .cd__headline-text”。

其他网站的工作例如:

url_cnbc = 'https://www.cnbc.com/world/?region=world'
webpage_cnbc = read_html(url_cnbc)
headlines_html_cnbc = html_nodes(webpage_cnbc,'.headline')
headlines_html_cnbc

提供全套标题。为什么有些网站会返回“ {xml_nodeset(0)}”结果?

非常感谢您提供任何帮助。

1 个答案:

答案 0 :(得分:2)

请,请停止使用选择器小工具。我知道哈德利发誓,但他100%错了。使用Selector小工具看到的是在执行javascript和异步加载其他资源之后在DOM中创建的内容。请使用“查看源代码”。这就是使用read_html()时得到的。

话虽如此,我对CNN一样大方(您可以抓取此页面)印象深刻,并且内容肯定在该页面上,只是没有呈现(可能更好):

enter image description here

现在,这是 javascript ,而不是 JSON ,因此我们需要V8包中的一些帮助:

library(rvest)
library(V8)

ctx <- v8()

# get the page source
pg <- read_html("https://edition.cnn.com/")

# find the node with the data in a <script> tag
html_node(pg, xpath=".//script[contains(., 'var CNN = CNN || {};CNN.isWebview')]") %>% 
  html_text() %>%  # get the plaintext
  ctx$eval() # sent it to V8 to execute it

cnn <- ctx$get("CNN") # get the data ^^ just created

浏览cnn对象之后:

str(cnn[["contentModel"]][["siblings"]][["articleList"]], 1)
## 'data.frame': 55 obs. of  7 variables:
##  $ uri        : chr  "/2018/11/16/politics/cia-assessment-khashoggi-assassination-saudi-arabia/index.html" "/2018/11/16/politics/hunt-crown-prince-saudi-un-resolution/index.html" "/2018/11/15/politics/us-khashoggi-sanctions/index.html" "/2018/11/15/middleeast/jamal-khashoggi-saudi-prosecutor-death-penalty-intl/index.html" ...
##  $ headline   : chr  "<strong>CIA determines Saudi Crown Prince personally ordered journalist's death, senior US official says</strong>" "Saudi crown prince's 'fit' over UN resolution" "US issues sanctions on 17 Saudis over Khashoggi murder" "Saudi prosecutor seeks death penalty for Khashoggi killers" ...
##  $ thumbnail  : chr  "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" ...
##  $ duration   : chr  "" "" "" "" ...
##  $ description: chr  "The CIA has determined that Saudi Crown Prince Mohammed bin Salman personally ordered the killing of journalist"| __truncated__ "Multiple sources tell CNN that a much-anticipated United Nations Security Council resolution calling for a cess"| __truncated__ "The Trump administration on Thursday imposed penalties on 17 individuals over their alleged roles in the <a hre"| __truncated__ "Saudi prosecutors said Thursday they would seek the death penalty for five people allegedly involved in the mur"| __truncated__ ...
##  $ layout     : chr  "" "" "" "" ...
##  $ iconType   : chr  NA NA NA NA ...