我的问题是关于用RSelenium进行刮擦。
我正在尝试从以下网站抓取数据:
“ https://www.nhtsa.gov/ratings”使用RSelenium。
我目前的困难在于如何在给定汽车制造商的页面之间跳转。
到目前为止,这是我的代码:
library(RSelenium)
#opens a connection
rD <- rsDriver()
remDr <- rD$client
#goes to the page we want
url <- "https://www.nhtsa.gov/ratings"
remDr$navigate(url)
#clicking to open the manufacturer selection "page"
webElem <- remDr$findElement(using = 'css selector', "#vehicle a")
webElem$clickElement()
#opening the options menu
option.menu <- remDr$findElement(using='css selector', 'select')
option.menu$clickElement()
#selecting one maker, loop over this later
maker.select <- remDr$findElement(using = 'xpath', "//*/option[@value = 'AUDI']")
maker.select$clickElement()
#search our selection
maker.click<-remDr$findElement(using='css selector', '.manufacturer-search-submit')
maker.click$clickElement()
#now we have to go through each car (10 per page), loop later
cars<-remDr$findElement(using='css selector', 'tbody:nth-child(6) a')
individual.link<-cars$getElementAttribute("href")
#going to the next page
next_page<-remDr$findElement(using='css selector', 'button.btn.link-arrow::after')
next_page$clickElement()
但是我得到了错误:
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
Further Details: run errorDetails method
您可能会看到我是RSelenium的新手。您能给我的任何帮助将不胜感激。预先感谢。
答案 0 :(得分:0)
这是另一种可能有用的方法。
您只需向网站发送GET
请求即可访问数据。在网站(第一页)上,我们可以看到
'https://api.nhtsa.gov/vehicles/byManufacturer?offset=0&max=10&sort=overallRating&order=desc&data=crashtestratings,recommendedfeatures&productDetail=all&dateStart=2011-01-01&manufacturerName=AUDI&dateEnd=3000-01-01&name='
这是我们获取数据的地方。第二页将包含offset=10
,然后是20,30,etc
。
如果将api_url
定义为上述网址,那么我们可以使用httr
# request the data
request <- httr::GET(api_url)
# retrieve the content
request_content <- httr::content(request)
request_result <- request_content$results
# request results contains the data of interest
# A few glimpses into the data
# The first model
request_result[[1]]$vehicleModel
# [1] "A3"
request_result[[1]]$modelYear
# [1] 2018
request_result[[1]]$manufacturer
# [1] "AUDI OF AMERICA, INC"
现在通过玩offset
可以直接建立循环并收集所有页面
out <- list()
k <- 0L
i <- 1L
while (k < 1e+3) {
req_url <- paste0('https://api.nhtsa.gov/vehicles/byManufacturer?offset=',
k,
'&max=10&sort=overallRating&order=desc&data=crashtestratings,recommendedfeatures&productDetail=all&dateStart=2011-01-01&manufacturerName=AUDI&dateEnd=3000-01-01&name=')
req <- httr::content(httr::GET(req_url))$result
if (length(req) == 0) break
out[[i]] <- req
cat(paste0('\nAdded content for offset \t', k))
i <- i + 1L
k <- k + 10L
}
lengths(out)
# [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
请注意,您还可以在网址中使用manufacturerName
,并使用更多参数来获得干净且量身定制的数据。