我正在尝试从以下网站中提取网址。这里棘手的是网站自动加载新页面。我没有设法获取用于抓取所有网址的xpath,包括新加载的网页上的那些网址 - 我只设法得到前15个网址(超过70个网址)。我假设最后一行中的xpath(new_results ...)缺少一些关键元素,也可以考虑后面的页面。有任何想法吗?谢谢!
# load packages
library(rvest)
library(httr)
library(RCurl)
library(XML)
library(stringr)
library(xml2)
# aim: download all speeches stored at:
# https://sheikhmohammed.ae/en-us/Speeches
# first, create vector which stores all urls to each single speech
all_links <- character()
new_results <- "/en-us/Speeches"
signatures = system.file("CurlSSL", cainfo = "cacert.pem", package = "RCurl")
options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
while(length(new_results) > 0){
new_results <- str_c("https://sheikhmohammed.ae", new_results)
results <- getURL(new_results, cainfo = signatures)
results_tree <- htmlParse(results)
all_links <- c(all_links, xpathSApply(results_tree,"//div[@class='speech-share-board']", xmlGetAttr,"data-url"))
new_results <- xpathSApply(results_tree,"//div[@class='speech-share-board']//after",xmlGetAttr, "data-url")}
# or, alternatively with phantomjs (also here, it loads only first 15 urls):
url <- "https://sheikhmohammed.ae/en-us/Speeches#"
# write out a script phantomjs can process
writeLines(sprintf("var page = require('webpage').create();
page.open('%s', function () {
console.log(page.content); //page source
phantom.exit();
});", url), con="scrape.js")
# process it with phantomjs
write(readLines(pipe("phantomjs scrape.js", "r")), "scrape.html")
答案 0 :(得分:1)
在RSelenium中运行Javascript进行延迟加载或在Python中运行Selenium将是解决问题的最优雅方法。然而,作为一个不太优雅但更快的替代方案,可以在firefox开发模式/网络功能中手动更改json查询的设置,以一次加载15个但更多(=全部)的语句。这对我来说很好,我能够通过json响应提取所有链接。