Question

我有几个网址列表，我希望获取他们的HTML内容。这些URL是从Twitter获取的，我不知道链接的内容。它们可能是指向网页以及音乐或视频的链接。这是如何阅读网址列表链接的html内容：

from multiprocessing.dummy import Pool as ThreadPool

def fetch_url(argv):

    url = argv[0]
    output = None

    print "processing url {}".format(url)

    try:
        # sending the request
        req = requests.get(url, stream=True)

        # checking if it is an html page
        content_type = req.headers.get('content-type')
        if 'text/html' in content_type or 'application/xhtml+xml' in content_type:

            # reading the contents
            html = req.content
            req.close()

            output = html

        else:
            print "\t{} is not an HTML file".format(url)
            req.close()

    except Exception, e:
        print "\t HTTP request was not accepted for {}; {}".format(url, e)

    return output


with open('url_list_1.pkl', 'rb') as fp:
   url_list = pickle.load(fp)

"""
The url_list has such structure:
url_list = [u'http://t.co/qmIPqQVBmW',
            u'http://t.co/mE8krkEejV',
            ...]
"""

pool = Thread
pool = ThreadPool(N_THREADS)

# open the results in their own threads and return the results
func = fetch_url
results = pool.map(func,url_list)

# close the pool and wait for the work to finish
pool.close()
pool.join()

代码在大多数列表中都没有任何问题，但是对于其中一些列表，它会卡住并且无法完成工作。我认为一些URL不会返回响应。我该如何解决这个问题？例如，等待X秒的请求，如果它没有响应，忘记它并移动到下一个URL？为什么会这样？

Answer 1

当然，您可以为您的请求设置超时（以秒为单位），这非常简单！

req = requests.get(url, stream=True, timeout=1)

引自python请求：

超时不是整个响应下载的时间限制;相反，如果服务器没有发出超时秒响应（更确切地说，如果在超时秒内没有在底层套接字上收到任何字节），则会引发异常。

更多信息：http://docs.python-requests.org/en/latest/user/quickstart/#timeouts

获取URL列表的内容

1 个答案: