R-如何从多个页面抓取链接

时间:2019-06-18 22:11:32

标签: r rvest rselenium

我想用平面广告和自家广告抓取页面上的所有链接: olx.pl: https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/?search%5Bfilter_float_price%3Afrom%5D=200000&search%5Bphotos%5D=1,然后从下一页开始。

我知道rvest是在抓取页面内容,但是可以用它来创建页面的link_vector吗?

如果您想使用rselenium,则还需要下载chrome驱动程序。

带有rselenium和chrome驱动程序的代码:

#additional packages
install.packages("RSelenium")
install.packages("rvest")

#create libraries
library(RSelenium)
library(rvest)

##---need to download chromedriver and selenium standalone
##---selenium.jar need to run in background

#remote driver
remDr <- remoteDriver(remoteServerAddr = "localhost",
                      port = 4444,
                      browserName = "chrome")

newUrl<-"https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/?search%5Bfilter_float_price%3Afrom%5D=200000&search%5Bphotos%5D=1"

remDr$open()
remDr$getStatus()
remDr$navigate(newUrl)

#updated link with &page=
newUrl<-"https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/?search%5Bfilter_float_price%3Afrom%5D=200000&search%5Bphotos%5D=1&page="

link_vector<-c()

#let's take links from first 20 pages
for (j in 1:20){
  pagenumberURL <- paste0(newUrl,j)
  remDr$navigate(pagenumberURL)
  elems <- remDr$findElements(using='class name',"detailsLink")
  link_vectorTemp <- unlist(lapply(elems,function(x){x$getElementAttribute("href")}))
  link_vector<-c(link_vector,link_vectorTemp)
}

link_vector<-link_vector %>% unique()

remDr$close()

问题,是否有可能在后台不运行rselenium和selenium独立运行而仅在rvest的情况下实现?

0 个答案:

没有答案