R Webscraping多个压缩文件

时间:2018-05-30 06:54:31

标签: r web-scraping zip

这个问题已被问到,但我还没有找到解决方案。 我想从一个网站上抓取一些压缩的.dat文件。 然而,我现在就是这样:

library(XML)
url<-c("http://blablabla")
zipped <-htmlParse(url)
nodes_a<-getNodeSet(zipped,"//a")
files<-grep("*.zip",sapply(nodes_a, function(nodes_a) 
xmlGetAttr(nodes_a,"href")),value=TRUE)
urls<-paste(url,files,sep="")

然后我用这个:

mapply(function(x,y) download.file(x,y),urls,files)

这是我收到的错误消息:

Error in mapply(function(x, y) download.file(x, y), urls, files) : 
 zero-length inputs cannot be mixed with those of non-zero length

任何提示?

1 个答案:

答案 0 :(得分:0)

完全无用的“请给我们发送电子邮件”页面引入了一种条件,即我们必须保持状态以进行任何进一步的导航或下载,并从带有注册表的页面开始并对其进行抓取以获取“身份验证者令牌” ”(从页面上继续)到下一个请求(主要出于安全目的)

library(curlconverter)
library(xml2)
library(httr)
library(rvest)

pg <- read_html("https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration")

html_nodes(pg, "input[name='_authenticator']") %>% 
  html_attr("value") -> authenticator

我查看了POST表单使用curlconverter发出的请求(在SO上查找如何使用它或阅读该GitLab项目站点)并提出:

httr::POST(
  url = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration",
  httr::add_headers(
    `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:63.0) Gecko/20100101 Firefox/63.0",
    Referer = "https://www.cpc.unc.edu/projects/china/data/datasets/data-downloads-registration"
  ),
  httr::set_cookies(`restriction-/projects/china/data/datasets/data_downloads` = "/projects/china/data/datasets/data_downloads"),
  body = list(
    `first-name` = "Steve",
    `last-name` = "Rogers",
    `email-address` = "example@me.com",
    `interest` = "a researcher",
    `org` = "The Avengers",
    `department` = "Operations",
    `postal-address` = "1 Avengers Drive",
    `city-name` = "Undisclosed",
    `state-province` = "Virginia",
    `postal-code` = "09911",
    `country-name` = "US",
    `opt-in:boolean:default` = "",
    `fieldset` = "default",
    `form.submitted` = "1",
    `add_reference.field:record` = "",
    `add_reference.type:record"` = "",
    `add_reference.destination:record"` = "",
    `last_referer` = "https://www.cpc.unc.edu/projects/china/data/datasets",
    `_authenticator` = authenticator,
    `form_submit` = "Submit"
  ), 
  encode = "multipart"
) -> res

({curlconverter通过开发人员工具中特定项目的简单“复制”为您制作了^^)

希望您能看到authenticator的位置。

现在我们已经获取到文件了。

首先,我们需要进入下载页面:

read_html(httr::content(res, as = "text")) %>% 
  html_nodes(xpath=".//p[contains(., 'You may now')]/strong/a") %>% 
  html_attr("href") -> dl_pg_link

dl_pg <- httr::GET(url = dl_pg_link)

然后我们需要进入 real 下载页面:

httr::content(dl_pg, as = "text") %>% 
  read_html() %>% 
  html_nodes(xpath=".//a[contains(@class, 'contenttype-folder state-published url')]") %>% 
  html_attr("href") -> dls

然后,我们需要从该页面获取所有可下载的位:

zip_pg <- httr::GET(url = dls)

httr::content(zip_pg, as = "text") %>% 
  read_html() %>% 
  html_nodes("td > a") %>% 
  html_attr("href") %>% 
  gsub("view$", "at_download/file", .) -> dl_links

如何获得第一个:

(fil1 <- httr::GET(dl_links[1]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/weights-chns.pdf/at_download/file]
##   Date: 2018-10-14 03:03
##   Status: 200
##   Content-Type: application/pdf
##   Size: 197 kB
## <BINARY BODY>

fil1$headers[["content-disposition"]]
## [1] "attachment; filename=\"weights-chns.pdf\""

writeBin(
  httr::content(fil1, as = "raw"),
  file.path("~/Data", gsub('"', '', strsplit(fil1$headers[["content-disposition"]], "=")[[1]][2])))
)

(fil2 <- httr::GET(dl_links[2]))
## Response [https://www.cpc.unc.edu/projects/china/data/datasets/data_downloads/longitudinal/Biomarker_2012Dec.zip/at_download/file]
##   Date: 2018-10-14 03:06
##   Status: 200
##   Content-Type: application/zip
##   Size: 2.37 MB
## <BINARY BODY>

(PDF格式),下面是获取第二个ZIP格式的方法:

fil2$headers[["content-disposition"]]
## [1] "attachment; filename=\"Biomarker_2012Dec.zip\""

writeBin(
  httr::content(fil2, as = "raw"),
  file.path("~/Data", gsub('"', '', strsplit(fil2$headers[["content-disposition"]], "=")[[1]][2])))
)

您可以将^^转换为迭代操作。

请注意,自基础curl软件包(其中包含以下内容)以来,每次启动新的R会话时,您必须从此顶部开始(即,从输入电子邮件表单页面开始)。权力httrrvest)会为您维护会话状态(在Cookie中)。