我的代码存在问题。
#!/usr/bin/env python3.1
import urllib.request;
# Disguise as a Mozila browser on a Windows OS
userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)';
URL = "www.example.com/img";
req = urllib.request.Request(URL, headers={'User-Agent' : userAgent});
# Counter for the filename.
i = 0;
while True:
fname = str(i).zfill(3) + '.png';
req.full_url = URL + fname;
f = open(fname, 'wb');
try:
response = urllib.request.urlopen(req);
except:
break;
else:
f.write(response.read());
i+=1;
response.close();
finally:
f.close();
当我创建urllib.request.Request对象(称为req)时似乎出现了问题。我用一个不存在的url创建它,但后来我将url更改为它应该是什么。我这样做是为了让我可以使用相同的urllib.request.Request对象,而不必在每次迭代时创建新的对象。可能有一种机制可以在python中做到这一点,但我不确定它是什么。
修改 错误信息是:
>>> response = urllib.request.urlopen(req);
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.1/urllib/request.py", line 121, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python3.1/urllib/request.py", line 356, in open
response = meth(req, response)
File "/usr/lib/python3.1/urllib/request.py", line 468, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python3.1/urllib/request.py", line 394, in error
return self._call_chain(*args)
File "/usr/lib/python3.1/urllib/request.py", line 328, in _call_chain
result = func(*args)
File "/usr/lib/python3.1/urllib/request.py", line 476, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
编辑2 :我的解决方案如下。可能应该在一开始就这样做,因为我知道它会起作用:
import urllib.request;
# Disguise as a Mozila browser on a Windows OS
userAgent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)';
# Counter for the filename.
i = 0;
while True:
fname = str(i).zfill(3) + '.png';
URL = "www.example.com/img" + fname;
f = open(fname, 'wb');
try:
req = urllib.request.Request(URL, headers={'User-Agent' : userAgent});
response = urllib.request.urlopen(req);
except:
break;
else:
f.write(response.read());
i+=1;
response.close();
finally:
f.close();
答案 0 :(得分:5)
urllib2
适用于只需要进行一两次网络交互的小脚本,但如果你做了很多工作,你可能会发现urllib3
或{{3 (这不是巧合地建立在前者身上),可能更适合您的需求。您的特定示例可能如下所示:
from itertools import count
import requests
HEADERS = {'user-agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
URL = "http://www.example.com/img%03d.png"
# with a session, we get keep alive
session = requests.session()
for n in count():
full_url = URL % n
ignored, filename = URL.rsplit('/', 1)
with file(filename, 'wb') as outfile:
response = session.get(full_url, headers=HEADERS)
if not response.ok:
break
outfile.write(response.content)
编辑:如果您可以使用常规HTTP身份验证(强烈建议403 Forbidden
响应),那么您可以使用requests.get
参数将其添加到auth
,如下所示:
response = session.get(full_url, headers=HEADERS, auth=('username','password))
答案 1 :(得分:0)
如果您想对每个请求使用自定义用户代理,您可以继承FancyURLopener
。
答案 2 :(得分:-2)
收到例外情况时不要破坏。变化
except:
break
到
except:
#Probably should log some debug information here.
pass
这将跳过所有有问题的请求,因此不会影响整个过程。