我想用平面广告和自家广告抓取页面上的所有链接: olx.pl: https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/?search%5Bfilter_float_price%3Afrom%5D=200000&search%5Bphotos%5D=1,然后从下一页开始。
我知道rvest是在抓取页面内容,但是可以用它来创建页面的link_vector吗?
如果您想使用rselenium,则还需要下载chrome驱动程序。
带有rselenium和chrome驱动程序的代码:
#additional packages
install.packages("RSelenium")
install.packages("rvest")
#create libraries
library(RSelenium)
library(rvest)
##---need to download chromedriver and selenium standalone
##---selenium.jar need to run in background
#remote driver
remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "chrome")
newUrl<-"https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/?search%5Bfilter_float_price%3Afrom%5D=200000&search%5Bphotos%5D=1"
remDr$open()
remDr$getStatus()
remDr$navigate(newUrl)
#updated link with &page=
newUrl<-"https://www.olx.pl/nieruchomosci/mieszkania/sprzedaz/?search%5Bfilter_float_price%3Afrom%5D=200000&search%5Bphotos%5D=1&page="
link_vector<-c()
#let's take links from first 20 pages
for (j in 1:20){
pagenumberURL <- paste0(newUrl,j)
remDr$navigate(pagenumberURL)
elems <- remDr$findElements(using='class name',"detailsLink")
link_vectorTemp <- unlist(lapply(elems,function(x){x$getElementAttribute("href")}))
link_vector<-c(link_vector,link_vectorTemp)
}
link_vector<-link_vector %>% unique()
remDr$close()
问题,是否有可能在后台不运行rselenium和selenium独立运行而仅在rvest的情况下实现?