R:在Rvest中使用pipechain命令刮取多个URL

时间:2016-03-28 10:46:04

标签: r web-scraping rcurl rvest httr

我有一个包含多个网址的chr列表'。我想从每个网址下载内容'。

为避免写出数百条命令,我希望使用lapply循环自动化。

但是我的命令会返回错误。是否有可能从多个网址中删除?

当前方法

长方法:有效,但我希望自动化

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")

library(rvest)
library(httr) # required for user_agent command

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus")
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia")
writeBin(session2$response$content, "test1.txt") 
writeBin(session3$response$content, "test2.txt")

自动/循环:不起作用。

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")

library(rvest)
library(httr) # required for user_agent command

uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))
lapply(urls, .%>% jump_to(session))
Error: is.session(x) is not TRUE

摘要

我希望自动执行以下两个流程jump_to()writeBin(),如下面的代码所示

session2<-jump_to(session, "https://en.wikipedia.org/wiki/Belarus")
session3<-jump_to(session, "https://en.wikipedia.org/wiki/Russia")
writeBin(session2$response$content, "test1.txt") 
writeBin(session3$response$content, "test2.txt")

1 个答案:

答案 0 :(得分:0)

您可以这样做:

urls <-c("https://en.wikipedia.org/wiki/Belarus","https://en.wikipedia.org/wiki/Russia","https://en.wikipedia.org/wiki/England")
require(httr)
require(rvest)
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://en.wikipedia.org/wiki/Main_Page", user_agent(uastring))

outfile <- sprintf("%s.html", sub(".*/", "", urls))

jump_and_write <- function(x, url, out_file){
  tmp = jump_to(x, url)
  writeBin(tmp$response$content, out_file) 
}

for(i in seq_along(urls)){
  jump_and_write(session, urls[i], outfile[i])
}