Question

我正在使用R来抓取一些网页。其中一个页面是重定向到新页面。当我将readLines与此页面一起使用时

test <- readLines('http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25')

我得到了仍在重定向的页面，而不是最终页面http://zfin.org/ZDB-GENE-030131-9076。我想使用此重定向页面，因为在URL中它具有input_name=anxa，这使得可以轻松获取不同输入名称的页面。

如何获取最终页面的HTML？

重定向页面：http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25

最后一页：http://zfin.org/ZDB-GENE-030131-9076

Answer 1

我不知道如何等待重定向，但在重定向之前在网页的源代码中，您可以看到（在脚本标记中）一个javascript函数replaceLocation，其中包含您的路径重定向：replaceLocation(\"/ZDB-GENE-030131-9076\")。

然后我建议你解析代码并获得这条路径。这是我的解决方案：

library(RCurl)
library(XML)

url <- "http://zfin.org/cgi-bin/webdriver?MIval=aa-markerselect.apg&marker_type=GENE&query_results=t&input_name=anxa5b&compare=contains&WINSIZE=25"

domain <- "http://zfin.org"

doc <- htmlParse(getURL(url, useragent='R'))

scripts <- xpathSApply(doc, "//script", xmlValue)

script <- scripts[which(lapply(lapply(scripts, grep, pattern = "replaceLocation\\([^url]"), length) > 0)]

# > script
# [1] "\n          \n\t    \n\t      replaceLocation(\"/ZDB-GENE-030131-9076\")\n            \n          \n\t"

new.url <- paste0(domain, gsub('.*\\"(.*)\\".*', '\\1', script))

readLines(new.url)

xpathSApply(doc, "//script", xmlValue)获取源代码中的所有脚本。

script <- scripts[which(lapply(lapply(scripts, grep, pattern = "replaceLocation\\([^url]"), length) > 0)]获取包含重定向路径的函数的脚本。

（"replaceLocation\\([^url]"您需要排除“url”，因为有两个replaceLocation函数，一个是对象url，另一个是带有计算对象的（字符串））

最后gsub('.*\\"(.*)\\".*', '\\1', script)只获取脚本中所需的内容，函数的参数，路径。

希望这有帮助！

如何在读取R中的行之前等待网页加载？

1 个答案: