Question

我正试图将联合国安理会（UNSC）决议的文本写入R.联合国维持联合国安理会所有决议的在线档案（PDF格式here）。因此，从理论上讲，这应该是可行的。

如果我点击特定年份的超链接，然后点击特定文档的链接（例如this one），我可以在浏览器中看到PDF。当我尝试通过将download.file指向URL栏中的链接来下载该PDF时，它似乎可以正常工作。当我尝试使用pdf_text包中的pdftools函数将该文件的内容读入R时，我收到一堆错误消息。

这就是我尝试失败的原因。如果你运行它，你会看到我正在谈论的错误信息。

library(pdftools)
pdflink <- "http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)"
tmp <- tempfile()
download.file(pdflink, tmp, mode = "wb")
doc <- pdf_text(tmp)

我错过了什么？我认为它与这些文件的可下载版本的链接地址有所不同，不同于浏览器内显示的链接地址，但我无法弄清楚如何获取路径对前者。我试着右键单击下载图标;使用＆＃34;检查＆＃34; Chrome中的选项可以查看标识为＆＃39; src＆＃39;那里（this link）;并指出我的其余过程。同样，download.file部分执行，但是当我运行pdf_text时，我收到相同的错误消息。我还试过a）改变调用mode的{{1}}部分和b）修改＆＃34; .pdf＆＃34;到download.file路径的尽头，但这些都没有帮助。

Answer 1

您要下载的pdf位于主页面的iframe中，因此您下载的链接仅包含html。您需要按照iframe中的链接获取pdf的实际链接。在到达下载pdf的直接链接之前，您需要跳转到几个页面以获取cookie /临时URL。

以下是您发布的链接的示例：

rm(list=ls())
library(rvest)
library(pdftools)

s <- html_session("http://www.un.org/en/ga/search/view_doc.asp?symbol=S/RES/2341(2017)")
#get the link in the mainFrame iframe holding the pdf
frame_link <- s %>% read_html() %>% html_nodes(xpath="//frame[@name='mainFrame']") %>%
  html_attr("src")

#go to that link
s <- s %>% jump_to(url=frame_link)

#there is a meta refresh with a link to another page, get it and go there
temp_url <- s %>% read_html() %>%
  html_nodes("meta") %>%
  html_attr("content") %>% {gsub(".*URL=","",.)} 

s <- s %>% jump_to(url=temp_url)

#get the LtpaToken cookie then come back
s %>% jump_to(url="https://documents-dds-ny.un.org/prod/ods_mother.nsf?Login&Username=freeods2&Password=1234") %>%
  back() 

#get the pdf link and download it
pdf_link <- s %>% read_html() %>% 
  html_nodes(xpath="//meta[@http-equiv='refresh']") %>%
  html_attr("content") %>% {gsub(".*URL=","",.)}

s <- s %>% jump_to(pdf_link)
tmp <- tempfile()
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)
doc

将PDF从iframe刮到R

1 个答案: