我正在尝试使用html_node函数和SelectorGadget刮掉一些新闻网站的头条新闻,但发现有些结果不正确,给出的结果为“ {xml_nodeset(0)}”。例如,以下代码给出了这样的结果:
url_cnn = 'https://edition.cnn.com/'
webpage_cnn = read_html(url_cnn)
headlines_html_cnn = html_nodes(webpage_cnn,'.cd__headline-text')
headlines_html_cnn
我使用SelectorGadget获得的“ .cd__headline-text”。
其他网站的工作例如:
url_cnbc = 'https://www.cnbc.com/world/?region=world'
webpage_cnbc = read_html(url_cnbc)
headlines_html_cnbc = html_nodes(webpage_cnbc,'.headline')
headlines_html_cnbc
提供全套标题。为什么有些网站会返回“ {xml_nodeset(0)}”结果?
非常感谢您提供任何帮助。
答案 0 :(得分:2)
请,请请停止使用选择器小工具。我知道哈德利发誓,但他100%错了。使用Selector小工具看到的是在执行javascript和异步加载其他资源之后在DOM中创建的内容。请使用“查看源代码”。这就是使用read_html()
时得到的。
话虽如此,我对CNN一样大方(您可以抓取此页面)印象深刻,并且内容肯定在该页面上,只是没有呈现(可能更好):
现在,这是 javascript ,而不是 JSON ,因此我们需要V8
包中的一些帮助:
library(rvest)
library(V8)
ctx <- v8()
# get the page source
pg <- read_html("https://edition.cnn.com/")
# find the node with the data in a <script> tag
html_node(pg, xpath=".//script[contains(., 'var CNN = CNN || {};CNN.isWebview')]") %>%
html_text() %>% # get the plaintext
ctx$eval() # sent it to V8 to execute it
cnn <- ctx$get("CNN") # get the data ^^ just created
浏览cnn
对象之后:
str(cnn[["contentModel"]][["siblings"]][["articleList"]], 1)
## 'data.frame': 55 obs. of 7 variables:
## $ uri : chr "/2018/11/16/politics/cia-assessment-khashoggi-assassination-saudi-arabia/index.html" "/2018/11/16/politics/hunt-crown-prince-saudi-un-resolution/index.html" "/2018/11/15/politics/us-khashoggi-sanctions/index.html" "/2018/11/15/middleeast/jamal-khashoggi-saudi-prosecutor-death-penalty-intl/index.html" ...
## $ headline : chr "<strong>CIA determines Saudi Crown Prince personally ordered journalist's death, senior US official says</strong>" "Saudi crown prince's 'fit' over UN resolution" "US issues sanctions on 17 Saudis over Khashoggi murder" "Saudi prosecutor seeks death penalty for Khashoggi killers" ...
## $ thumbnail : chr "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025083025-prince-mohammed-bin-salman-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" "//cdn.cnn.com/cnnnext/dam/assets/181025171830-jamal-khashoggi-small-11.jpg" ...
## $ duration : chr "" "" "" "" ...
## $ description: chr "The CIA has determined that Saudi Crown Prince Mohammed bin Salman personally ordered the killing of journalist"| __truncated__ "Multiple sources tell CNN that a much-anticipated United Nations Security Council resolution calling for a cess"| __truncated__ "The Trump administration on Thursday imposed penalties on 17 individuals over their alleged roles in the <a hre"| __truncated__ "Saudi prosecutors said Thursday they would seek the death penalty for five people allegedly involved in the mur"| __truncated__ ...
## $ layout : chr "" "" "" "" ...
## $ iconType : chr NA NA NA NA ...