这个问题已被问到,但我还没有找到解决方案。 我想从一个网站上抓取一些压缩的.dat文件。 然而,我现在就是这样:
library(XML)
url<-c("http://blablabla")
zipped <-htmlParse(url)
nodes_a<-getNodeSet(zipped,"//a")
files<-grep("*.zip",sapply(nodes_a, function(nodes_a)
xmlGetAttr(nodes_a,"href")),value=TRUE)
urls<-paste(url,files,sep="")
然后我用这个:
mapply(function(x,y) download.file(x,y),urls,files)
这是我收到的错误消息:
Error in mapply(function(x, y) download.file(x, y), urls, files) :
zero-length inputs cannot be mixed with those of non-zero length
任何提示?
答案 0 :(得分:0)
完全无用的“请给我们发送电子邮件”页面引入了一种条件,即我们必须保持状态以进行任何进一步的导航或下载,并从带有注册表的页面开始并对其进行抓取以获取“身份验证者令牌” ”(从页面上继续)到下一个请求(主要出于安全目的)
library(curlconverter)
library(xml2)
library(httr)
library(rvest)
pg <- read_html("https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration")
html_nodes(pg, "input[name='_authenticator']") %>%
html_attr("value") -> authenticator
我查看了POST
表单使用curlconverter
发出的请求(在SO上查找如何使用它或阅读该GitLab项目站点)并提出:
httr::POST(
url = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration",
httr::add_headers(
`User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0",
Referer = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration"
),
httr::set_cookies(`restriction-/projects/china/data/datasets/data_downloads` = "/projects/china/data/datasets/data_downloads"),
body = list(
`first-name` = "Steve",
`last-name` = "Rogers",
`email-address` = "example@me.com",
`interest` = "a researcher",
`org` = "The Avengers",
`department` = "Operations",
`postal-address` = "1 Avengers Drive",
`city-name` = "Undisclosed",
`state-province` = "Virginia",
`postal-code` = "09911",
`country-name` = "US",
`opt-in:boolean:default` = "",
`fieldset` = "default",
`form.submitted` = "1",
`add_reference.field:record` = "",
`add_reference.type:record"` = "",
`add_reference.destination:record"` = "",
`last_referer` = "https://www.cpc.unc.edu/projects/china/data/datasets",
`_authenticator` = authenticator,
`form_submit` = "Submit"
),
encode = "multipart"
) -> res
({curlconverter
通过开发人员工具中特定项目的简单“复制”为您制作了^^)
希望您能看到authenticator
的位置。
现在我们已经获取到文件了。
首先,我们需要进入下载页面:
read_html(httr::content(res, as = "text")) %>%
html_nodes(xpath=".//p[contains(., 'You may now')]/strong/a") %>%
html_attr("href") -> dl_pg_link
dl_pg <- httr::GET(url = dl_pg_link)
然后我们需要进入 real 下载页面:
httr::content(dl_pg, as = "text") %>%
read_html() %>%
html_nodes(xpath=".//a[contains(@class, 'contenttype-folder state-published url')]") %>%
html_attr("href") -> dls
然后,我们需要从该页面获取所有可下载的位:
zip_pg <- httr::GET(url = dls)
httr::content(zip_pg, as = "text") %>%
read_html() %>%
html_nodes("td > a") %>%
html_attr("href") %>%
gsub("view$", "at_download/file", .) -> dl_links
如何获得第一个:
(fil1 <- httr::GET(dl_links[1]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/weights-chns.pdf/at_download/file]
## Date: 2018-10-14 03:03
## Status: 200
## Content-Type: application/pdf
## Size: 197 kB
## <BINARY BODY>
fil1$headers[["content-disposition"]]
## [1] "attachment; filename=\"weights-chns.pdf\""
writeBin(
httr::content(fil1, as = "raw"),
file.path("~/Data", gsub('"', '', strsplit(fil1$headers[["content-disposition"]], "=")[[1]][2])))
)
(fil2 <- httr::GET(dl_links[2]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/Biomarker_2012Dec.zip/at_download/file]
## Date: 2018-10-14 03:06
## Status: 200
## Content-Type: application/zip
## Size: 2.37 MB
## <BINARY BODY>
(PDF格式),下面是获取第二个ZIP格式的方法:
fil2$headers[["content-disposition"]]
## [1] "attachment; filename=\"Biomarker_2012Dec.zip\""
writeBin(
httr::content(fil2, as = "raw"),
file.path("~/Data", gsub('"', '', strsplit(fil2$headers[["content-disposition"]], "=")[[1]][2])))
)
您可以将^^转换为迭代操作。
请注意,自基础curl
软件包(其中包含以下内容)以来,每次启动新的R会话时,您必须从此顶部开始(即,从输入电子邮件表单页面开始)。权力httr
和rvest
)会为您维护会话状态(在Cookie中)。