我试图遵循以下给出的多线程示例: Python urllib2.urlopen() is slow, need a better way to read several urls但我似乎遇到了“线程错误”,我不确定这是什么意思。
urlList=[list of urls to be fetched]*100
def read_url(url, queue):
my_data=[]
try:
data = urllib2.urlopen(url,None,15).read()
print('Fetched %s from %s' % (len(data), url))
my_data.append(data)
queue.put(data)
except HTTPError, e:
data = urllib2.urlopen(url).read()
print('Fetched %s from %s' % (len(data), url))
my_data.append(data)
queue.put(data)
def fetch_parallel():
result = Queue.Queue()
threads = [threading.Thread(target=read_url, args = (url,result)) for url in urlList]
for t in threads:
t.start()
for t in threads:
t.join()
return result
res=[]
res=fetch_parallel()
reslist = []
while not res.empty: reslist.append(res.get())
print (reslist)
我收到以下第一个错误:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "demo.py", line 76, in read_url
print('Fetched %s from %s' % (len(data), url))
TypeError: object of type 'instancemethod' has no len()
另一方面,我看到有时,它似乎确实获取数据,但后来我得到以下第二个错误:
Traceback (most recent call last):
File "demo.py", line 89, in <module>
print str(res[0])
AttributeError: Queue instance has no attribute '__getitem__'
当它获取数据时,为什么结果没有显示在res []中?谢谢你的时间。
更新在read_url()函数中更改read to read()之后,虽然情况有所改善(我现在获取了很多页面提取),但仍然出现错误:
Exception in thread Thread-86:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "demo.py", line 75, in read_url
data = urllib2.urlopen(url).read()
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 429, in error
result = self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 605, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python2.7/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 435, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 502: Bad Gateway
答案 0 :(得分:4)
请注意urllib2 is not thread-safe。因此,您应该使用urllib3。
您的一些问题与线程完全无关。线程只会使错误报告更复杂。而不是
data = urllib2.urlopen(url).read
你想要
data = urllib2.urlopen(url).read()
# ^^
502 Bad gateway
错误表示服务器配置错误(很可能是您正在连接的Web服务的内部服务器正在重新启动/不可用)。你无能为力 - 目前无法访问网址。使用try..except
来处理这些错误,例如通过打印诊断消息,或在适当的等待期后安排要检索的URL,或者通过省略失败的数据集。
要从队列中获取值,您可以执行以下操作:
res = fetch_parallel()
reslist = []
while not res.empty():
reslist.append(res.get_nowait()) # or get, doesn't matter here
print (reslist)
如果URL确实无法访问,也无法进行真正的错误处理。简单地重新请求它可能在某些情况下有效,但您必须能够处理此时远程主机真正无法访问的情况。你如何做到这一点取决于你的应用程序的逻辑。