我有一个项目可以在两个数字之间抓取页面。我的蜘蛛在下面。它从一个数字开始到一个数字,并在这些页面之间刮擦。
我希望在连续10页404页后停止播放。但它必须保存CSV,直到停止位置。
额外:是否可以将停止的号码写入另一个文本文件?
以下是我的日志示例:
2017-01-25 19:57:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://domain.com/entry/65848514>
{'basligi': [u'murat boz'],
'entry': [u'<a href=https://domain.com/entry/65848514'],
'favori': [u'0'],
'yazari': [u'thrones']}
2017-01-25 19:57:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://domain.com/entry/65848520>
{'basligi': [u'fatih portakal'],
'entry': [u'<a href=https://domain.com/entry/65848520'],
'favori': [u'0'],
'yazari': [u'agamustaf']}
2017-01-25 19:57:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://domain.com/entry/65848525> (referer: None)
2017-01-25 19:57:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://domain.com/entry/65848528> (referer: None)
2017-01-25 19:57:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://domain.com/entry/65848529> (referer: None)
2017-01-25 19:57:26 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://domain.com/entry/65848527> (referer: None)
我的蜘蛛:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from project.items import ProjectItem
from scrapy import Request
class MySpider(BaseSpider):
name = "project"
allowed_domains = ["domain.com"]
start_urls = ["https://domain.com/entry/%d" % i for i in range(65848505,75848535)]
def parse(self, response):
titles = HtmlXPathSelector(response).select('//li')
for title in titles:
item = ProjectItem()
item['favori'] = title.select("//*[@id='entry-list']/li/@data-favorite-count").extract()
item['entry'] = ['<a href=https://domain.com%s'%a for a in title.select("//*[@class='entry-date permalink']/@href").extract()]
item['yazari'] = title.select("//*[@id='entry-list']/li/@data-author").extract()
item['basligi'] = title.select("//*[@id='topic']/h1/@data-title").extract()
return item
答案 0 :(得分:0)
有很多方法可以做到这一点,最简单的方法是使用回调捕获404错误,计算它们并在某些条件下引发CloseSpider
异常。例如:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from project.items import ProjectItem
from scrapy import Request
from scrapy.exceptions import CloseSpider
class MySpider(BaseSpider):
name = "project"
allowed_domains = ["domain.com"]
start_urls = ["https://domain.com/entry/%d" % i for i in range(65848505,75848535)]
handle_httpstatus_list = [404] # to catch 404 with callback
count_404 = 0
def parse(self, response):
if response.status == 404:
self.count_404 += 1
if self.count_404 == 10:
# stop spider on condition
raise CloseSpider('Number of 404 errors exceeded')
return None
else:
self.count_404 = 0
titles = HtmlXPathSelector(response).select('//li')
for title in titles:
item = ProjectItem()
item['favori'] = title.select("//*[@id='entry-list']/li/@data-favorite-count").extract()
item['entry'] = ['<a href=https://domain.com%s'%a for a in title.select("//*[@class='entry-date permalink']/@href").extract()]
item['yazari'] = title.select("//*[@id='entry-list']/li/@data-author").extract()
item['basligi'] = title.select("//*[@id='topic']/h1/@data-title").extract()
return item
更优雅的解决方案是编写自定义下载中间件来处理案例。
PS:问题中左start_urls
,但是生成10 000 000个链接的列表并将其保留在内存中是极端的开销,您应该使用start_urls
的生成器或覆盖start_requests
。
答案 1 :(得分:-1)
对于更干净的项目,您可以将其作为扩展名:
<强> extensions.py 强>
from scrapy.exceptions import NotConfigured
from scrapy import signals
from urlparse import urlparse
class CloseSpiderByStatusCount(object):
def __init__(self, crawler):
if not crawler.settings.getint('CLOSESPIDER_BYSTATUS_ENABLED', False):
raise NotConfigured
self.crawler = crawler
self.status = crawler.settings.getint('CLOSESPIDER_BYSTATUS_STATUS', 404)
self.closing_count = crawler.settings.getint('CLOSESPIDER_BYSTATUS_COUNT', 10)
self.count = 0
crawler.signals.connect(self.status_count, signal=signals.response_received)
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def status_count(self, response, request, spider):
if response.status == self.status:
self.count += 1
else:
self.count = 0
if self.count == self.closing_count:
f = open('filename.txt', 'w')
f.write(urlparse(request.url).path.split('/')[-1])
self.crawler.engine.close_spider(spider, 'closespider_statuscount')
然后不要忘记在设置上激活它并使用此扩展添加的新设置变量:
<强> settings.py 强>
# activating the extension
EXTENSIONS = {
...
'myproject.extensions.CloseSpiderByStatusCount': 100,
...
}
CLOSESPIDER_BYSTATUS_ENABLED = True
CLOSESPIDER_BYSTATUS_STATUS = 404
CLOSESPIDER_BYSTATUS_COUNT = 10
现在您可以从设置中配置状态和计数。