我正在尝试从互联网下载pdf。我有一连串的链接,需要从互联网上获取pdf。
我有这段代码:
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
url = 'http://webapps.rrc.texas.gov/CMPL/viewPdfReportFormAction.do?method=cmplG1FormPdf&packetSummaryId=2928'
opts = Options()
opts.headless = True
assert opts.headless # Operating in headless mode
browser_detail = Firefox(options=opts)
browser_detail.get(url)
print(browser_detail.page_source)
with open('temp/metadata.pdf', 'wb') as fd:
fd.write(browser_detail.page_source)
browser_detail.close()
我也尝试过请求。相同的响应:
import requests
url = 'http://webapps.rrc.texas.gov/CMPL/viewPdfReportFormAction.do?method=cmplG1FormPdf&packetSummaryId=2928'
r = requests.get(url, stream=True)
with open('temp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(2000):
fd.write(chunk)
问题是,如果我将URL放入浏览器,则会出现pdf,但是当我将其放入此代码时,page_source是html。这使我认为涉及转发或服务器端处理。
如何下载PDF? 谢谢!
答案 0 :(得分:2)
我能够使用requests
下拉PDF文件。
页面正在寻找合适的User-Agent
,因此我将其设置为Chrome MacOS。
h = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8","User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36" }
r = requests.get(url, stream=True, headers=h)
它奏效了。
tmp/project/1> file metadata.pdf
metadata.pdf: PDF document, version 1.4
答案 1 :(得分:-3)
with open('temp/metadata.pdf', 'wb') as fd:
fd.write(r.content)