我正在使用Rselenium抓取https://www.oddsportal.com。我尝试并行执行此操作,因为我有5万个URL,并希望减少运行时间。问题是,对于每个群集/节点,我得到的结果都是相同的(3)。每个网址标题获得3次访问。我正在使用这篇帖子Run yaml file for parallel selenium test from R or python
中的代码这是我尝试并行处理的方法,但是它并不能按照我想要的方式工作:
library(RSelenium)
library(rvest)
library(magrittr)
library(foreach)
library(doParallel)
URLsPar <- c("http://www.bbc.com/", "http://www.cnn.com", "http://www.google.com",
"http://www.yahoo.com", "http://www.twitter.com", "http://www.oddsportal.com")
appHTML <- c()
(cl <- (detectCores() - 1) %>% makeCluster) %>% registerDoParallel
clusterEvalQ(cl, {
library(RSelenium)
eCap <- list(phantomjs.page.settings.userAgent
= "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0", phantomjs.page.settings.loadImages = FALSE, phantomjs.phantom.cookiesEnabled = FALSE, phantomjs.phantom.javascriptEnabled = TRUE)
remDr <- remoteDriver(browserName = "phantomjs", port=8910, extraCapabilities = eCap)
remDr$open()
})
myTitles <- c()
strt<-Sys.time()
ws <- foreach(x = 1:6, .packages = c("rvest", "magrittr", "RSelenium")) %dopar% {
remDr$navigate(URLsPar[x])
remDr$getTitle()[[1]]
}
end<-strt-Sys.time()
clusterEvalQ(cl, {
remDr$close()
})
stopImplicitCluster()
输出:
[[1]]
[1] "Google"
[[2]]
[1] "Google"
[[3]]
[1] "Google"
[[4]]
[1] "Odds Portal: Odds Comparison, Sports Betting Odds"
[[5]]
[1] "Odds Portal: Odds Comparison, Sports Betting Odds"
[[6]]
[1] "Odds Portal: Odds Comparison, Sports Betting Odds"
我还尝试将remDr $ navigate(url)添加到集群,因为我需要登录页面才能继续进行操作,并稍后在foreach函数中获取数据。如果将remDr$navigate(url)
添加到clusterEvalQ
中,只会得到以下内容:
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
有人可以帮助我如何进行并行抓取,而不是为每个节点获取相同的数据并且还能够登录吗?
带有导航的代码:
clusterEvalQ(cl, {
library(RSelenium)
eCap <- list(phantomjs.page.settings.userAgent
= "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20120101 Firefox/29.0", phantomjs.page.settings.loadImages = FALSE, phantomjs.phantom.cookiesEnabled = FALSE, phantomjs.phantom.javascriptEnabled = TRUE)
remDr <- remoteDriver(browserName = "phantomjs", port=8910, extraCapabilities = eCap)
remDr$open()
remDr$navigate("https://www.oddsportal.com/login/")
})