我正在尝试复制this tutorial on rvest here。但是,一开始我已经遇到了问题。这是我正在使用的代码
library(rvest)
#Specifying the url for desired website to be scrapped
url <- 'https://www.nytimes.com/section/politics'
#Reading the HTML code from the website - headlines
webpage <- read_html(url)
headline_data <- html_nodes(webpage,'.story-link a, .story-body a')
当我看着headline_data
返回时的结果
{xml_nodeset (0)}
但是在本教程中,它返回长度为48的列表
{xml_nodeset (48)}
有任何差异的原因吗?
答案 0 :(得分:1)
如注释中所述,没有要搜索的具有指定类的元素。
首先,根据当前标签,您可以使用以下标题:
library(rvest)
library(dplyr)
url <- 'https://www.nytimes.com/section/politics'
url %>%
read_html() %>%
html_nodes("h2.css-l2vidh a") %>%
html_text()
#[1] "Trump’s Secrecy Fight Escalates as Judge Rules for Congress in Early Test"
#[2] "A Would-Be Trump Aide’s Demands: A Jet on Call, a Future Cabinet Post and More"
#[3] "He’s One of the Biggest Backers of Trump’s Push to Protect American Steel. And He’s Canadian."
#[4] "Accountants Must Turn Over Trump’s Financial Records, Lower-Court Judge Rules"
并获取您可以做的那些标题的单独URL
url %>%
read_html() %>%
html_nodes("h2.css-l2vidh a") %>%
html_attr("href") %>%
paste0("https://www.nytimes.com", .)
#[1] "https://www.nytimes.com/2019/05/20/us/politics/mcgahn-trump-congress.html"
#[2] "https://www.nytimes.com/2019/05/20/us/politics/kris-kobach-trump.html"
#[3] "https://www.nytimes.com/2019/05/20/us/politics/hes-one-of-the-biggest-backers-of-trumps-push-to-protect-american-steel-and-hes-canadian.html"
#[4] "https://www.nytimes.com/2019/05/20/us/politics/trump-financial-records.html"