有没有办法告诉scrapy根据二级页面中的条件停止爬行?我正在做以下事情:
现在我正在使用CloseSpider()来完成此任务,但问题是,当我开始抓取二级页面时,要解析的URL已经排队了,我不知道如何从队列中删除它们。有没有办法顺序抓取链接列表,然后能够在parseDetailPage中停止?
global job_in_range
start_urls = []
start_urls.append("http://sfbay.craigslist.org/sof/")
def __init__(self):
self.job_in_range = True
def parse(self, response):
hxs = HtmlXPathSelector(response)
results = hxs.select('//blockquote[@id="toc_rows"]')
items = []
if results:
links = results.select('.//p[@class="row"]/a/@href')
for link in links:
if link is self.end_url:
break;
nextUrl = link.extract()
isValid = WPUtil.validateUrl(nextUrl);
if isValid:
item = WoodPeckerItem()
item['url'] = nextUrl
item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage)
items.append(item)
else:
self.error.log('Could not parse the document')
return items
def parseDetailPage(self, response):
if self.job_in_range is False:
raise CloseSpider('End date reached - No more crawling for ' + self.name)
hxs = HtmlXPathSelector(response)
print response
body = hxs.select('//article[@id="pagecontainer"]/section[@class="body"]')
item = response.meta['item']
item['postDate'] = body.select('.//section[@class="userbody"]/div[@class="postinginfos"]/p')[1].select('.//date/text()')[0].extract()
if item['jobTitle'] is 'Admin':
self.job_in_range = False
raise CloseSpider('Stop crawling')
item['jobTitle'] = body.select('.//h2[@class="postingtitle"]/text()')[0].extract()
item['description'] = body.select(str('.//section[@class="userbody"]/section[@id="postingbody"]')).extract()
return item
答案 0 :(得分:0)
你的意思是你想停止蜘蛛并恢复它而不解析已被解析的网址? 如果是这样,您可以尝试设置the JOB_DIR setting。此设置可以将request.queue保留在磁盘上的指定文件中。