在我的手机上,当我访问网址时,浏览器会自动下载pdf。我正在尝试使用Python检索或下载pdf
import wget
url =
['https://careers.ottawa.ca/sap/bc/webdynpro/sap/hrrcf_a_posting_apply?PARAM=cG9zdF9pbnN0X2d1aWQ9NTc3QkFCNjBFQ0REMURGQ0UxMDAwMDAwQzBBODI0MTAmY2FuZF90eX
BlPUVYVA%3d%3d&sap-client=210&sap-language=EN&sap-accessibility=X#',
'https://careers.ottawa.ca/sap/bc/webdynpro/sap/hrrcf_a_posting_apply?PARAM=cG9zdF9pbnN0X2d1aWQ9NThDQTM2RTg4QzBDMjcxQUUxMDAwMDAwQzBBODI0MTAmY2FuZF90eXBlPUVYVA%3d%3d&sap-client=210&sap-language=EN&sap-accessibility=X#'
]
filename = wget.download(url[0])
在页面源(Inspect元素)上,我注意到了这个标签
<embed id="plugin" type="application/x-google-chrome-pdf" src="https://careers.ottawa.ca/sap(bD1lbiZjPTIxMA==)/bc/bsp/sap/hrrcf_wd_dovru/application.do?PARAM=cmNmdHlwZT1waW5zdCZwaW5zdD01NzdCQUI2MEVDREQxREZDRTEwMDAwMDBDMEE4MjQxMA%3d%3d" stream-url="blob:chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/906CF02E-9437-45E3-8B77-BEB7950B4BE1" headers="cache-control: no-cache
content-disposition: inline; filename=City_of_Ottawa_JobPoster.pdf
content-length: 30721
content-type: application/pdf
pragma: no-cache
server: SAP NetWeaver Application Server / ABAP 701
">
并且在浏览器中,当我访问url包含在src属性中时,我可以查看pdf,但我无法使用wget下载它
url3= 'https://careers.ottawa.ca/sap(bD1lbiZjPTIxMA==)/bc/bsp/sap/hrrcf_wd_dovru/application.do?PARAM=cmNmdHlwZT1waW5zdCZwaW5zdD01NzdCQUI2MEVDREQxREZDRTEwMDAwMDBDMEE4MjQxMA%3d%3d'
filename = wget.download(url3)
我收到错误
c:\users\gmondesi\appdata\local\continuum\miniconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
587 class HTTPDefaultErrorHandler(BaseHandler):
588 def http_error_default(self, req, fp, code, msg, hdrs):
--> 589 raise HTTPError(req.full_url, code, msg, hdrs, fp)
590
591 class HTTPRedirectHandler(BaseHandler):
HTTPError: HTTP Error 500: Internal Server Error
有没有办法使用BeautifulSoup
下载pdf。我找不到页面源中的静态元素,我可以依赖它来下载pdf。 embed
标记只能在浏览器的Inspect page source
选项中找到。