我正在尝试构建一个网络抓取工具,使用R抓取新闻网站www.20min.ch上发布的文章。它们的api是可公开访问的,因此我可以创建一个包含标题,URL,描述和时间戳的数据框与rvest。下一步将是访问每个链接,并创建文章文本列表,并将其与我的数据框组合。但是我不知道如何自动访问这些文章。理想情况下,我想read_html链接1,然后复制带有html节点的文本,然后继续链接2 ...
这是我到目前为止写的:
site20min <- read_xml("https://api.20min.ch/rss/view/1")
site20min
url_list <- site20min %>% html_nodes('link') %>% html_text()
df20min <- data.frame(Title = character(),
Zeit = character(),
Lead = character(),
Text = character()
)
for(i in 1:length(url_list)){
myLink <- url_list[i]
site20min <- read_html(myLink)
titel20min <- site20min %>% html_nodes('h1 span') %>% html_text()
zeit20min <- site20min %>% html_nodes('#story_content .clearfix span') %>% html_text()
lead20min <- site20min %>% html_nodes('#story_content h3') %>% html_text()
text20min <- site20min %>% html_nodes('.story_text') %>% html_text()
df20min_a <- data.frame(Title = titel20min)
df20min_b <- data.frame(Zeit = zeit20min)
df20min_c <- data.frame(Lead = lead20min)
df20min_d <- data.frame(Text = text20min)
}
我需要的是R打开每个链接并提取一些信息:
site20min_1 <- read_html("https://www.20min.ch/schweiz/news/story/-Es-liegen-auch-Junge-auf-der-Intensivstation--14630453")
titel20min_1 <- site20min_1 %>% html_nodes('h1 span') %>% html_text()
zeit20min_1 <- site20min_1 %>% html_nodes('#story_content .clearfix span') %>% html_text()
lead20min_1 <- site20min_1 %>% html_nodes('#story_content h3') %>% html_text()
text20min_1 <- site20min_1 %>% html_nodes('.story_text') %>% html_text()
将其绑定到数据帧应该不是太大的问题。但此刻我的一些结果却是空的。
谢谢您的帮助!
答案 0 :(得分:0)
您在建立数据框的正确轨道上。您可以遍历每个链接并将其rbind
链接到现有的数据框结构。
首先,您可以设置要循环访问的网址向量。根据修改,这是一个矢量:
url_list <- c("http://www.20min.ch/ausland/news/story/14618481",
"http://www.20min.ch/schweiz/news/story/18901454",
"http://www.20min.ch/finance/news/story/21796077",
"http://www.20min.ch/schweiz/news/story/25363072",
"http://www.20min.ch/schweiz/news/story/19113494",
"http://www.20min.ch/community/social_promo/story/20407354",
"https://cp.20min.ch/de/stories/635-stressfrei-durch-den-verkehr-so-sieht-der-alltag-von-busfahrer-claudio-aus")
接下来,您可以设置一个数据框结构,其中包括您要获取的所有内容。
# Set up the dataframe first
df20min <- data.frame(Title = character(),
Link = character(),
Lead = character(),
Zeit = character())
最后,您可以遍历列表中的每个url,并将相关信息添加到数据框中。
# Go through a loop
for(i in 1:length(url_list)){
myLink <- url_list[i]
site20min <- read_xml(myLink)
# Extract the info
titel20min <- site20min %>% html_nodes('title') %>% html_text()
link20min <- site20min %>% html_nodes('link') %>% html_text()
zeit20min <- site20min %>% html_nodes('pubDate') %>% html_text()
lead20min <- site20min %>% html_nodes('description') %>% html_text()
# Structure into dataframe
df20min_a <- data.frame(Title = titel20min, Link =link20min, Lead = lead20min)
df20min_b <- df20min_a [-(1:2),]
df20min_c <- data.frame(Zeit = zeit20min)
# Insert into final dataframe
df20min <- rbind(df20min, cbind(df20min_b,df20min_c))
}