我目前正在使用urllib2在python2.7中编写一个简单的爬虫程序。这是下载器类。
class Downloader:
def __init__(self, limit = 3):
self.limit = limit
def downloadGet(self, url):
request = urllib2.Request(url)
retry = 0
succ = False
page = None
while retry < self.limit:
print "Retry: " + str(retry) + " Limit:" + str(self.limit)
try:
response = urllib2.urlopen(request)
page = response.read()
succ = True
break
except:
retry += 1
return succ, page
每个网址都会被尝试三次。还使用了多线程,线程代码如下:
class DownloadThread(Thread):
def __init__(self, requestGet, limit):
Thread.__init__(self)
self.requestGet = requestGet
self.downloader = Downloader(limit)
def run(self):
while True:
url = self.requestGet()
if url == None:
break
ret = self.download(url)
print ret
def download(self, url):
# some other staff
succ, flv = self.downloader.downloadGet(url)
return succ
然而,在实验期间,线程的编号设置为5,下载器在尝试3次后不会停止。对于某些线程,输出甚至显示“重试:4280限制:3”。似乎忽略了while条件。
任何帮助和建议都受到欢迎。谢谢!
答案 0 :(得分:5)
downloadGet
:limit
中无限循环的一个可能原因是字符串对象。
如果limit
是字符串,则{2.}}在Python 2.x中产生retry < self.limit
:
True
检查>>> retry = 4280
>>> limit = '3'
>>> retry < limit
True
传递的类型。
答案 1 :(得分:0)
如果网址不为空,则DownloadThread
代码中没有任何内容可以突破while循环。
答案 2 :(得分:0)
你应该以更Pythonic的方式定义你的循环:
def downloadGet(self, url):
...
# do not declare retry before this
for retry in xrange(self.limit):
...
try:
编辑:
或者,你可以利用while
比尝试break
更清楚地处理你的循环状态(虽然我觉得我的第一个例子不那么脆弱):
def downloadGt(self, url):
...
while retry in xrange(self.limit) or succ == False:
...
这有利于更多自我记录。
虽然,我会考虑将循环重构为下载而不是下载。像这样:
class DownloadThread(Thread):
...
def download(self, url):
for retry in xrange(self.downloader.limit):
succ, flv = self.downloader.downloadGet(url)
if succ:
return succ
class Downloader(object):
...
def downloadGet(self, url)
request = urllib2.Request(url)
try:
response = urllib2.urlopen(request)
page = response.read()
# always qualify your exception handlers
# or you may be masking errors you don't know about
except urllib2.HTTPError:
return False, None
return True, page