我已经使用这个模块(urllib2)玩了一段时间了。最近,我设法制作了一个简单的HTTP状态检查器,用于检查给定列表的每个URL的已接收状态代码,如果没有给出200好的代码,则将其删除。
代码如下:
for p in urllist:
req = urllib2.Request(p)
try:
resp = urllib2.urlopen(req)
except urllib2.HTTPError as e:
if e.code == 404:
print str(p)+ " returns 404 error (Not found). This URL will be removed from the list"
urllist.remove(p)
elif e.code == 400 or e.code == 401 or e.code == 403:
print str(p) + " returns a 400 error (Bad request) or 401/403 error (Unauthorized/forbidden) This URL will be removed fromt the list"
urllist.remove(p)
elif e.code == 408:
print str (p) + " returned a 408 error (request timeout) This URL may or may not be available soon, this URL will be kept in the list"
elif e.code == 429:
print str(p) + " returned a 429 error (too many requests). The script may have reached a request limit, abort and try again later"
elif 500 <= e.code <= 511:
print str(p) + " returned a 5xx error (server error). servers may be unavailable at the moment. Please abort and try again later"
elif 410 <= e.code <= 451 or ecode > 511:
print str(p) + " has returned an unespecified http error. This URL will be removed from the list"
urllist.remove(p)
except urllib2.URLError as e:
print str(p) + " returned an unespecified error. This URL will be removed from the list"
urllist.remove(p)
else:
# 200
body = resp.read()
print str(p) + " returns a 200 status code (Ok). This URL exists."
原始代码来自this post
我使用bit.ly url对此进行了测试,这些URL很简单,而且不会很乏味地放入列表中。它们中的大多数都按预期返回一个或另一个http状态代码。但是其中一些仅持续3倍多时间被脚本接受/删除,一个示例是bit / 1 / 1da2,在输入时会弹出警告。
我检查了各种生成的链接列表,该脚本唯一的问题是带有警告它们的URL。它尝试获取大约2分钟的http状态代码? (我尚未计时),然后跳转到列表中的下一个URL,而不从列表中删除该链接。
我认为可以在此脚本的URLError部分解决此问题,但我不确定。