使用httplib检查URL是否会返回某个页面?

时间:2014-03-02 22:16:06

标签: python httplib

我正在浏览数百个bit.ly链接,看看它们是否已用于缩短链接。如果链接没有,则返回this page

如何迭代链接列表以检查哪些链接不返回此页面?

我尝试使用this question中使用的head方法,但当然总是返回true。

我查看了head方法,但发现它永远不会返回任何数据:

>>> import httplib
>>> conn = httplib.HTTPConnection("www.python.org")
>>> conn.request("HEAD","/index.html")
>>> res = conn.getresponse()
>>> print res.status, res.reason
200 OK
>>> data = res.read()
>>> print len(data)
0
>>> data == ''
True

我很难过,任何帮助都会很棒。

2 个答案:

答案 0 :(得分:1)

如果bit.ly为非缩短链接返回404 http代码:

#!/usr/bin/env python
from httplib import HTTPConnection
from urlparse import urlsplit

urls = ["http://bit.ly/NKEIV8", "http://bit.ly/1niCdh9"]
for url in urls:
    host, path = urlsplit(url)[1:3]
    conn = HTTPConnection(host)
    conn.request("HEAD", path)
    r = conn.getresponse()
    if r.status != 404:
       print("{r.status} {url}".format(**vars()))

无关:为了加快检查速度,您可以使用多个线程:

#!/usr/bin/env python
from httplib import HTTPConnection
from multiprocessing.dummy import Pool # use threads
from urlparse import urlsplit

def getstatus(url):
    try:
        host, path = urlsplit(url)[1:3]
        conn = HTTPConnection(host)
        conn.request("HEAD", path)
        r = conn.getresponse()
    except Exception as e:
        return url, None, str(e) # error
    else:
        return url, r.status, None

p = Pool(20) # use 20 concurrent connections
for url, status, error in p.imap_unordered(getstatus, urls):
    if status != 404:
       print("{status} {url} {error}".format(**vars()))

答案 1 :(得分:0)

所以,这是一个简单的方法:

import httplib2
h = httplib2.Http(".cache")
resp, content = h.request("http://www.python.org/", "GET")
print content

来源:https://code.google.com/p/httplib2/wiki/Examples