获取URL列表的内容

时间:2015-04-13 08:52:17

标签: python web-scraping

我有几个网址列表,我希望获取他们的HTML内容。这些URL是从Twitter获取的,我不知道链接的内容。它们可能是指向网页以及音乐或视频的链接。这是如何阅读网址列表链接的html内容:

from multiprocessing.dummy import Pool as ThreadPool

def fetch_url(argv):

    url = argv[0]
    output = None

    print "processing url {}".format(url)

    try:
        # sending the request
        req = requests.get(url, stream=True)

        # checking if it is an html page
        content_type = req.headers.get('content-type')
        if 'text/html' in content_type or 'application/xhtml+xml' in content_type:

            # reading the contents
            html = req.content
            req.close()

            output = html

        else:
            print "\t{} is not an HTML file".format(url)
            req.close()

    except Exception, e:
        print "\t HTTP request was not accepted for {}; {}".format(url, e)

    return output


with open('url_list_1.pkl', 'rb') as fp:
   url_list = pickle.load(fp)

"""
The url_list has such structure:
url_list = [u'http://t.co/qmIPqQVBmW',
            u'http://t.co/mE8krkEejV',
            ...]
"""

pool = Thread
pool = ThreadPool(N_THREADS)

# open the results in their own threads and return the results
func = fetch_url
results = pool.map(func,url_list)

# close the pool and wait for the work to finish
pool.close()
pool.join()

代码在大多数列表中都没有任何问题,但是对于其中一些列表,它会卡住并且无法完成工作。我认为一些URL不会返回响应。我该如何解决这个问题?例如,等待X秒的请求,如果它没有响应,忘记它并移动到下一个URL?为什么会这样?

1 个答案:

答案 0 :(得分:1)

当然,您可以为您的请求设置超时(以秒为单位),这非常简单!

req = requests.get(url, stream=True, timeout=1)

引自python请求:

  

超时不是整个响应下载的时间限制;相反,如果服务器没有发出超时秒响应(更确切地说,如果在超时秒内没有在底层套接字上收到任何字节),则会引发异常。

更多信息:http://docs.python-requests.org/en/latest/user/quickstart/#timeouts