Question

我已经使用这个模块（urllib2）玩了一段时间了。最近，我设法制作了一个简单的HTTP状态检查器，用于检查给定列表的每个URL的已接收状态代码，如果没有给出200好的代码，则将其删除。

代码如下：

 for p in urllist:
    req = urllib2.Request(p)
    try:
        resp = urllib2.urlopen(req)
    except urllib2.HTTPError as e:
        if e.code == 404:
            print str(p)+ " returns 404 error (Not found). This URL will be removed from the list"
            urllist.remove(p)
        elif e.code == 400 or e.code == 401 or e.code == 403:
            print str(p) + " returns a 400 error (Bad request) or 401/403 error (Unauthorized/forbidden) This URL will be removed fromt the list"
            urllist.remove(p)
        elif e.code == 408:
            print str (p) + " returned a 408 error (request timeout) This URL may or may not be available soon, this URL will be kept in the list"
        elif e.code == 429:
            print str(p) + " returned a 429 error (too many requests). The script may have reached a request limit, abort and try again later"           
        elif 500 <= e.code <= 511:
            print str(p) + " returned a 5xx error (server error). servers may be unavailable at the moment. Please abort and try again later"
        elif 410 <= e.code <= 451 or ecode > 511:
            print str(p) + " has returned an unespecified http error. This URL will be removed from the list"
            urllist.remove(p)

    except urllib2.URLError as e:
         print str(p) + " returned an unespecified error. This URL will be removed from the list"
         urllist.remove(p)
    else:
        # 200
        body = resp.read()
        print str(p) + " returns a 200 status code (Ok). This URL exists."

原始代码来自this post

我使用bit.ly url对此进行了测试，这些URL很简单，而且不会很乏味地放入列表中。它们中的大多数都按预期返回一个或另一个http状态代码。但是其中一些仅持续3倍多时间被脚本接受/删除，一个示例是bit / 1 / 1da2，在输入时会弹出警告。

我检查了各种生成的链接列表，该脚本唯一的问题是带有警告它们的URL。它尝试获取大约2分钟的http状态代码？（我尚未计时），然后跳转到列表中的下一个URL，而不从列表中删除该链接。

我认为可以在此脚本的URLError部分解决此问题，但我不确定。

urllib2 http状态不适用于某些链接

0 个答案: