在网页中查找元素-Rselenium / rvest

时间:2018-11-10 19:35:25

标签: r rvest rselenium

我正尝试从此网站https://www.linklaters.com/en/find-a-lawyer收集所有单独的URL(律师的URL)。我找不到提取URL的方法-当我使用CSS选择器时,它不起作用。您能否建议其他方法来查找网页中的特定元素? 另外,要收集所有数据,我需要单击“加载更多”按钮,并且我正在使用RSelenium。 我认为通过docker运行Rselenium并没有做出正确的选择,因为它出现了错误- checkError(res)中的错误:   httr调用中发生未定义的错误。 httr输出:无法连接到本地主机端口4445:连接被拒绝

library(dplyr)
library(rvest)
library(stringr)
library(RSelenium)

link = "https://www.linklaters.com/en/find-a-lawyer"
hlink = read_html(link)
urls <- hlink %>%
        html_nodes(".listCta__subtitle--top") %>%
        html_attr("href")
urls <- as.data.frame(urls, stringsAsFactors = FALSE)
names(urls) <- "urls"

remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                                 port = 4445L,
                                 browserName = "chrome")
remDr$open()

replicate(20,
          {       # scroll down
                  webElem <- remDr$findElement("css", "body")
                  webElem$sendKeysToElement(list(key = "end"))
                  # find button
                  allURL <- remDr$findElement(using = "css selector", ".listCta__subtitle--top")
                  # click button
                  allURL$clickElement()
                  Sys.sleep(6)
          })

allURL <- xml2::read_html(remDr$getPageSource()[[1]])%>%
        rvest::html_nodes(".field--type-ds a") %>%
        html_attr("href")

1 个答案:

答案 0 :(得分:1)

它只是通过XHR请求加载动态数据。只需获取可爱的JSON:

jsonlite::fromJSON("https://www.linklaters.com/en/api/lawyers/getlawyers")
jsonlite::fromJSON("https://www.linklaters.com/en/api/lawyers/getlawyers?searchTerm=&sort=asc&showing=30")
jsonlite::fromJSON("https://www.linklaters.com/en/api/lawyers/getlawyers?searchTerm=&sort=asc&showing=60")
jsonlite::fromJSON("https://www.linklaters.com/en/api/lawyers/getlawyers?searchTerm=&sort=asc&showing=90")

保持递增30,直到出现错误结果为止,最好在两次请求之间有5 s的睡眠延迟,以免突然跳动。