Question

我抓取了一个shtml链接列表。现在，它们已保存在.xlsx文件中。

List

我已经尝试寻找excel宏，R代码，Python代码，Chrome扩展程序和桌面程序。我找不到任何对我有帮助的研究。

每个.shtml链接都指向一个网页，该网页的中心至少有一个.pdf，需要下载。

任何帮助表示赞赏！

Answer 1

基本工作流程是：

您需要使用css或xpath来找到pdf下载按钮。
使用Rselenium to simulate the download action；或获取href属性并使用rvest对该链接进行请求，然后使用writeBin()

要下载pdf文件，我将以政府表格为例：

pdf网址：https://www.uscis.gov/sites/default/files/files/form/i-765.pdf

library(rvest)
library(httr)

session <- html_session("https://www.uscis.gov/sites/default/files/files/form/i-765.pdf")

# save pdf to test.pdf
writeBin(session$response$content,"test.pdf")

Answer 2

那很有帮助！

install.packages("rvest")
install.packages("httr")
install.packages("readxl")
update.packages("tibble")

library(rvest)
library(httr)
library(readxl)

setwd("C:/Users/Andreas/Desktop/481064 A.F. - Master Thesis - Election Outcome Prediction/Full Repository Austrian Bundestag")
my_data <- read_excel("StenographischeProto.xlsx")
View(my_data)

session <- html_session("https://www.uscis.gov/sites/default/files/files/form/i-765.pdf")

# save pdf to test.pdf
writeBin(session$response$content,"test.pdf")

如何从一组shtml链接中抓取或下载pdf？

2 个答案: