如何从动态javascript页面下载带有wget或BS4的pdf

时间:2017-03-22 20:05:49

标签: javascript python wget python-3.5

在我的手机上,当我访问网址时,浏览器会自动下载pdf。我正在尝试使用Python检索或下载pdf

import wget
url = 

['https://careers.ottawa.ca/sap/bc/webdynpro/sap/hrrcf_a_posting_apply?PARAM=cG9zdF9pbnN0X2d1aWQ9NTc3QkFCNjBFQ0REMURGQ0UxMDAwMDAwQzBBODI0MTAmY2FuZF90eX
    BlPUVYVA%3d%3d&sap-client=210&sap-language=EN&sap-accessibility=X#',
 'https://careers.ottawa.ca/sap/bc/webdynpro/sap/hrrcf_a_posting_apply?PARAM=cG9zdF9pbnN0X2d1aWQ9NThDQTM2RTg4QzBDMjcxQUUxMDAwMDAwQzBBODI0MTAmY2FuZF90eXBlPUVYVA%3d%3d&sap-client=210&sap-language=EN&sap-accessibility=X#'
        ]
filename = wget.download(url[0])

在页面源(Inspect元素)上,我注意到了这个标签

<embed id="plugin" type="application/x-google-chrome-pdf" src="https://careers.ottawa.ca/sap(bD1lbiZjPTIxMA==)/bc/bsp/sap/hrrcf_wd_dovru/application.do?PARAM=cmNmdHlwZT1waW5zdCZwaW5zdD01NzdCQUI2MEVDREQxREZDRTEwMDAwMDBDMEE4MjQxMA%3d%3d" stream-url="blob:chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/906CF02E-9437-45E3-8B77-BEB7950B4BE1" headers="cache-control: no-cache
content-disposition: inline; filename=City_of_Ottawa_JobPoster.pdf
content-length: 30721
content-type: application/pdf
pragma: no-cache
server: SAP NetWeaver Application Server / ABAP 701
">

并且在浏览器中,当我访问url包含在src属性中时,我可以查看pdf,但我无法使用wget下载它

url3= 'https://careers.ottawa.ca/sap(bD1lbiZjPTIxMA==)/bc/bsp/sap/hrrcf_wd_dovru/application.do?PARAM=cmNmdHlwZT1waW5zdCZwaW5zdD01NzdCQUI2MEVDREQxREZDRTEwMDAwMDBDMEE4MjQxMA%3d%3d'
filename = wget.download(url3)

我收到错误

c:\users\gmondesi\appdata\local\continuum\miniconda3\lib\urllib\request.py in http_error_default(self, req, fp, code, msg, hdrs)
    587 class HTTPDefaultErrorHandler(BaseHandler):
    588     def http_error_default(self, req, fp, code, msg, hdrs):
--> 589         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    590
    591 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 500: Internal Server Error

有没有办法使用BeautifulSoup下载pdf。我找不到页面源中的静态元素,我可以依赖它来下载pdf。 embed标记只能在浏览器的Inspect page source选项中找到。

0 个答案:

没有答案