RSelenium scraping returning odd results

时间:2019-03-19 15:04:51

标签: r selenium web-scraping

I am trying to scrape some news sources search pages using RSelenium. Here's my code:

library(rvest)
library(RSelenium)

#open the browser
rD <- rsDriver(browser=c("chrome"), chromever="73.0.3683.68")
remDr <- rD[["client"]]

#create a blank space to put the links
urlslist_final = list()

##loop through the page number at the end until done with ~1000 / 20 = 50
for (i in 1:2) { ##change this to 50

  url = paste0('https://www.npr.org/search?query=kavanaugh&page=', i)

  #navigate to it
  remDr$navigate(url)

  #get the links
  webElems <- remDr$findElements(using = "css", "[href]")
  urlslist_final[[i]] = unlist(sapply(webElems, function(x) {x$getElementAttribute("href")}))

  #don't go too fast
  Sys.sleep(runif(1, 1, 5))

} #close the loop

remDr$close()
# stop the selenium server
rD[["server"]]$stop()

If I set i = 1 and CLICK over to the browser after the page is navigated to, then I get the desired results of 166 links with the specific result links I'm trying to scrape:

> str(urlslist_final)
List of 1
 $ : chr [1:166] "https://media.npr.org/templates/favicon/favicon-180x180.png" "https://media.npr.org/templates/favicon/favicon-96x96.png" "https://media.npr.org/templates/favicon/favicon-32x32.png" "https://media.npr.org/templates/favicon/favicon-16x16.png" ...

However, if just run my loop I get just 91 results and none of them are the actual results from the search:

> str(urlslist_final)
List of 2
$ : chr [1:91] "https://media.npr.org/templates/favicon/favicon-180x180.png" "https://media.npr.org/templates/favicon/favicon-96x96.png" "https://media.npr.org/templates/favicon/favicon-32x32.png" "https://media.npr.org/templates/favicon/favicon-16x16.png" ...

Any help understanding why the difference here? What can I do differently? I tried just using rvest but I couldn't get it to find the links embedded in their script for the results.

1 个答案:

答案 0 :(得分:0)

感谢我的朋友Thom,这是一个很好的解决方案:

SELECT TOP 1 DATEPART(YEAR, Cancel) [Year],
DATEPART(Month, Cancel) [Month], COUNT(1) [Count]
FROM Subscription
where DATEPART(YEAR, Cancel) >= 2018
GROUP BY DATEPART(year, Cancel),DATEPART(Month, Cancel)
ORDER BY Count DESC

我在导航到页面和捕获链接之间放置了这段代码,这触发了网站以为我在正确使用它,因此我可以抓取链接。