我有几个网址列表,我希望获取他们的HTML内容。这些URL是从Twitter获取的,我不知道链接的内容。它们可能是指向网页以及音乐或视频的链接。这是如何阅读网址列表链接的html内容:
from multiprocessing.dummy import Pool as ThreadPool
def fetch_url(argv):
url = argv[0]
output = None
print "processing url {}".format(url)
try:
# sending the request
req = requests.get(url, stream=True)
# checking if it is an html page
content_type = req.headers.get('content-type')
if 'text/html' in content_type or 'application/xhtml+xml' in content_type:
# reading the contents
html = req.content
req.close()
output = html
else:
print "\t{} is not an HTML file".format(url)
req.close()
except Exception, e:
print "\t HTTP request was not accepted for {}; {}".format(url, e)
return output
with open('url_list_1.pkl', 'rb') as fp:
url_list = pickle.load(fp)
"""
The url_list has such structure:
url_list = [u'http://t.co/qmIPqQVBmW',
u'http://t.co/mE8krkEejV',
...]
"""
pool = Thread
pool = ThreadPool(N_THREADS)
# open the results in their own threads and return the results
func = fetch_url
results = pool.map(func,url_list)
# close the pool and wait for the work to finish
pool.close()
pool.join()
代码在大多数列表中都没有任何问题,但是对于其中一些列表,它会卡住并且无法完成工作。我认为一些URL不会返回响应。我该如何解决这个问题?例如,等待X秒的请求,如果它没有响应,忘记它并移动到下一个URL?为什么会这样?
答案 0 :(得分:1)
当然,您可以为您的请求设置超时(以秒为单位),这非常简单!
req = requests.get(url, stream=True, timeout=1)
引自python请求:
超时不是整个响应下载的时间限制;相反,如果服务器没有发出超时秒响应(更确切地说,如果在超时秒内没有在底层套接字上收到任何字节),则会引发异常。
更多信息:http://docs.python-requests.org/en/latest/user/quickstart/#timeouts