我正在寻找使用python下载PDF并使用requests
库。以下代码可用于某些PDF文档,但对于少数文档则会引发错误。
from pathlib import Path
import requests
filename = Path('c:/temp.pdf')
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
response = requests.get(url,verify=False)
filename.write_bytes(response.content)
以下是确切的答复(response.content
),但是,我可以使用chrome浏览器下载相同的文档,而不会出现任何错误
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.rolls-royce.com/%7e/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf" on this server.<P>\nReference #18.36ad4d68.1562842755.6294c42\n</BODY>\n</HTML>\n'
有什么办法摆脱这种情况?
答案 0 :(得分:0)
您之所以得到403 Forbidden
,是因为默认情况下,requests
发送User-Agent: python-requests/2.19.1
标头,而服务器拒绝了您的请求。
您可以从浏览器中获取此标头的正确值,一切都会很好。
例如:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36'}
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
r = requests.get(url, headers=headers)
print(r.status_code) # 200