Python请求或urllib读取超时,URL编码问题?

时间:2015-09-29 18:20:19

标签: python pdf python-requests urllib

我正在尝试从Python中下载文件,我已经尝试了urllib和请求,并且都给了我超时错误。该文件位于:http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf

使用请求:

r = requests.get('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf',timeout=60.0)

使用urllib:

urllib.urlretrieve('http://www.prociv.pt/cnos/HAI/Setembro/Incêndios%20Rurais%20-%20Histórico%20do%20Dia%2029SET.pdf','the.pdf')

我尝试了不同的网址,例如:

而且,我可以使用浏览器和cURL使用以下语法下载它:

curl http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2029SET.pdf

所以我怀疑它是一个编码问题,但我似乎无法让它工作。有什么建议吗?

编辑:清晰度。

1 个答案:

答案 0 :(得分:2)

看起来服务器的响应方式不同,具体取决于客户端用户代理。如果您指定自定义User-Agent标头,则服务器将使用PDF进行响应:

import requests
import shutil

url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
headers = {'User-Agent': 'curl'}  # wink-wink
response = requests.get(url, headers=headers, stream=True)

if response.status_code == 200:
    with open('result.pdf', 'wb') as output:
        response.raw.decode_content = True
        shutil.copyfileobj(response.raw, output)

演示:

>>> import requests
>>> url = 'http://www.prociv.pt/cnos/HAI/Setembro/Inc%C3%AAndios%20Rurais%20-%20Hist%C3%B3rico%20do%20Dia%2028SET.pdf'
>>> headers = {'User-Agent': 'curl'}  # wink-wink
>>> response = requests.get(url, headers=headers, stream=True)
>>> response.headers['content-type']
'application/pdf'
>>> response.headers['content-length']
'466191'
>>> response.raw.read(100)
'%PDF-1.5\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(pt-PT) /StructTreeRoot 37 0 R/MarkInfo<</'

我的猜测是有人滥用Python脚本一次从该服务器下载太多文件,并且仅根据User-Agent标题进行焦点编码。