R:如何打开一个链接列表来抓取新闻网站的首页?

时间:2020-03-13 11:57:00

标签: r web web-scraping rvest

我正在尝试构建一个网络抓取工具,使用R抓取新闻网站www.20min.ch上发布的文章。它们的api是可公开访问的,因此我可以创建一个包含标题,URL,描述和时间戳的数据框与rvest。下一步将是访问每个链接,并创建文章文本列表,并将其与我的数据框组合。但是我不知道如何自动访问这些文章。理想情况下,我想read_html链接1,然后复制带有html节点的文本,然后继续链接2 ...

这是我到目前为止写的:

site20min <- read_xml("https://api.20min.ch/rss/view/1")

site20min

url_list <- site20min %>% html_nodes('link') %>% html_text()

df20min <- data.frame(Title = character(),
                      Zeit = character(),
                      Lead = character(),
                      Text = character()
                      )

 for(i in 1:length(url_list)){
      myLink <- url_list[i]
      site20min <- read_html(myLink)

      titel20min <- site20min %>% html_nodes('h1 span') %>% html_text()
      zeit20min <- site20min %>% html_nodes('#story_content .clearfix span') %>% html_text()
      lead20min <- site20min %>% html_nodes('#story_content h3') %>% html_text()
      text20min <- site20min %>% html_nodes('.story_text') %>% html_text()  

      df20min_a <- data.frame(Title = titel20min)
      df20min_b <- data.frame(Zeit = zeit20min)
      df20min_c <- data.frame(Lead = lead20min)
      df20min_d <- data.frame(Text = text20min)
              }

我需要的是R打开每个链接并提取一些信息:

site20min_1 <- read_html("https://www.20min.ch/schweiz/news/story/-Es-liegen-auch-Junge-auf-der-Intensivstation--14630453")

  titel20min_1 <- site20min_1 %>% html_nodes('h1 span') %>% html_text()
  zeit20min_1 <- site20min_1 %>% html_nodes('#story_content .clearfix span') %>% html_text()
  lead20min_1 <- site20min_1 %>% html_nodes('#story_content h3') %>% html_text()
  text20min_1 <- site20min_1 %>% html_nodes('.story_text') %>% html_text()

将其绑定到数据帧应该不是太大的问题。但此刻我的一些结果却是空的。

谢谢您的帮助!

1 个答案:

答案 0 :(得分:0)

您在建立数据框的正确轨道上。您可以遍历每个链接并将其rbind链接到现有的数据框结构。

首先,您可以设置要循环访问的网址向量。根据修改,这是一个矢量:

url_list <- c("http://www.20min.ch/ausland/news/story/14618481",
              "http://www.20min.ch/schweiz/news/story/18901454",
              "http://www.20min.ch/finance/news/story/21796077",
              "http://www.20min.ch/schweiz/news/story/25363072",
              "http://www.20min.ch/schweiz/news/story/19113494",
              "http://www.20min.ch/community/social_promo/story/20407354",
              "https://cp.20min.ch/de/stories/635-stressfrei-durch-den-verkehr-so-sieht-der-alltag-von-busfahrer-claudio-aus")

接下来,您可以设置一个数据框结构,其中包括您要获取的所有内容。

# Set up the dataframe first
df20min <- data.frame(Title = character(),
                      Link = character(),
                      Lead = character(),
                      Zeit = character())

最后,您可以遍历列表中的每个url,并将相关信息添加到数据框中。

# Go through a loop
for(i in 1:length(url_list)){
  myLink <- url_list[i]
  site20min <- read_xml(myLink)

  # Extract the info
  titel20min <- site20min %>% html_nodes('title') %>% html_text()
  link20min <- site20min %>% html_nodes('link') %>% html_text() 
  zeit20min <- site20min %>% html_nodes('pubDate') %>% html_text()
  lead20min <- site20min %>% html_nodes('description') %>% html_text()

  # Structure into dataframe
  df20min_a <- data.frame(Title = titel20min, Link =link20min, Lead = lead20min)
  df20min_b <- df20min_a [-(1:2),]
  df20min_c <- data.frame(Zeit = zeit20min)

  # Insert into final dataframe
  df20min <- rbind(df20min, cbind(df20min_b,df20min_c))
}
相关问题