使用R接受cookie以下载PDF文件

时间:2016-01-06 00:40:40

标签: r curl web-scraping httr

我在尝试下载PDF时遇到了问题。

例如,如果我对考古数据服务上的PDF文档有DOI,它将解析为this landing page 使用embedded link in it to this pdf但真正重定向到this其他链接。

library(httr)将处理解析DOI,我们可以使用library(XML)从着陆页中提取pdf网址,但我一直坚持获取PDF本身。

如果我这样做:

download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")

然后我收到一个与http://archaeologydataservice.ac.uk/myads/

相同的HTML文件

How to use R to download a zipped file from a SSL page that requires cookies尝试答案让我想到了这一点:

library(httr)

terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")

# Accept the terms on the form,
# generating the appropriate cookies

POST(terms, body = values)
GET(download, query = values)

# Actually download the file (this will take a while)

resp <- GET(download, query = values)

# write the content of the download to a binary file

writeBin(content(resp, "raw"), "c:/temp/thefile.zip")

但在POSTGET函数之后,我只是获得了与download.file相同的Cookie页面的HTML:

> GET(download, query = values)
Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]
  Date: 2016-01-06 00:35
  Status: 200
  Content-Type: text/html;charset=UTF-8
  Size: 21 kB
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "h...
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
        <head>
            <meta http-equiv="Content-Type" content="text/html; c...


            <title>Archaeology Data Service:  myADS</title>

            <link href="http://archaeologydataservice.ac.uk/css/u...
...

查看http://archaeologydataservice.ac.uk/about/Cookies,此网站的Cookie情况似乎很复杂。对于英国数据提供商来说,这类cookie的复杂性似乎并不罕见:automating the login to the uk data service website in R with RCurl or httr

如何使用R来浏览本网站上的cookie?

2 个答案:

答案 0 :(得分:6)

你已经听到了rOpenSci的请求!

这些网页之间存在大量的javascript,这使得尝试通过httr + rvest进行解密有点烦人。试试RSelenium。这适用于OS X 10.11.2,R 3.2.3&amp; Firefox已加载。

library(RSelenium)

# check if a sever is present, if not, get a server
checkForServer()

# get the server going
startServer()

dir.create("~/justcreateddir")
setwd("~/justcreateddir")

# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
  `browser.download.folderList` = as.integer(2),
  `browser.download.dir` = getwd(),
  `pdfjs.disabled` = TRUE,
  `plugin.scan.plid.all` = FALSE,
  `plugin.scan.Acrobat` = "99.0",
  `browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()

# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")

# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector", "a.dlb3")
pdf_elem$sendKeysToElement(list("\uE007"))

# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("\uE007"))

现在等待下载完成。 R控制台在下载时不会很忙,因此在下载完成之前很容易意外关闭会话。

# close the session
dr$close()

答案 1 :(得分:3)

这个答案来自John Harrison的电子邮件,发布在他的请求中:

这将允许您下载PDF:

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
pdfData <- getBinaryURL(appURL, curl = curl, .opts = list(cookie = "ADSCOPYRIGHT=YES"))
writeBin(pdfData, "test2.pdf")

这是一个显示其工作的较长版本

appURL <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf"
library(RCurl)
library(XML)
curl = getCurlHandle()
curlSetOpt(cookiefile="cookies.txt"
           , curl=curl, followLocation = TRUE)
appData <- getURL(appURL, curl = curl)

# get the necessary elements for the POST that is initiated when the ACCEPT button is pressed

doc <- htmlParse(appData)
appAttrs <- doc["//input", fun = xmlAttrs]
postData <- lapply(appAttrs, function(x){data.frame(name = x[["name"]], value = x[["value"]]
                                                    , stringsAsFactors = FALSE)})
postData <- do.call(rbind, postData)

# post your acceptance
postURL <- "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid="
# get jsessionid
jsessionid <- unlist(strsplit(getCurlInfo(curl)$cookielist[1], "\t"))[7]

searchData <- postForm(paste0(postURL, jsessionid), curl = curl,
                       "j_id10" = "j_id10",
                       from = postData[postData$name == "from", "value"],
                       "javax.faces.ViewState" = postData[postData$name == "javax.faces.ViewState", "value"],
                       "j_id10:_idcl" = "j_id10:agreeButton"
                       , binary = TRUE
)
con <- file("test.pdf", open = "wb")
writeBin(searchData, con)
close(con)


Pressing the ACCEPT button on the page you gave initiates a POST to "http://archaeologydataservice.ac.uk/myads/copyrights.jsf;jsessionid=......" via some javascript.
This post then redirects to the page with the pdf having given some additional cookies.

Checking our cookies we see:

> getCurlInfo(curl)$cookielist
[1] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tJSESSIONID\t3d249e3d7c98ec35998e69e15d3e" 
[2] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tSSOSESSIONID\t3d249e3d7c98ec35998e69e15d3e"
[3] "archaeologydataservice.ac.uk\tFALSE\t/\tFALSE\t0\tADSCOPYRIGHT\tYES"          

so it would probably be sufficient to set that last cookie to start with (indicating we accept copyright)