我有一只需要找到产品价格的蜘蛛。这些产品分批组合在一起(来自数据库),拥有批处理状态(RUNNING,DONE)以及start_time
和finished_time
属性会很不错。
所以我有类似的东西:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
for prod in batch.get_products():
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # <-- NOT COOL: This is goind to
# execute before the last product
# url is scraped, right?
def parse(self, response):
#...
这里的问题是由于scrapy的异步性质,批处理对象的第二次状态更新将会运行得太快......对吗? 有没有办法以某种方式将这些请求组合在一起,并在解析最后一个请求时更新批处理对象?
答案 0 :(得分:2)
这是技巧
对于每个请求,请发送batch_id
,total_products_in_this_batch
和processed_this_batch
以及任何功能检查中的任何地方
for batch in Batches.objects.all():
processed_this_batch = 0
# TODO: Get some batch_id here
# TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`
for prod in batch.get_products():
processed_this_batch = processed_this_batch + 1
yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })
在代码的任何位置,对于任何特定批次,请检查if processed_this_batch == total_products_in_this_batch
,然后保存批次
答案 1 :(得分:1)
对于这种交易,你可以使用signal closed,你可以绑定一个函数,在蜘蛛完成爬行时运行。
答案 2 :(得分:0)
我对@Umair建议进行了一些调整,并提出了一个适合我案例的解决方案:
class PriceSpider(scrapy.Spider):
name = 'prices'
def start_requests(self):
for batch in Batches.objects.all():
batch.started_on = datetime.now()
batch.status = 'RUNNING'
batch.save()
products = batch.get_products()
counter = {'curr': 0, 'total': len(products)} # the counter dictionary
# for this batch
for prod in products:
yield scrapy.Request(product.get_scrape_url(),
meta={'prod': prod,
'batch': batch,
'counter': counter})
# trick = add the counter in the meta dict
def parse(self, response):
# process the response as desired
batch = response.meta['batch']
counter = response.meta['counter']
self.increment_counter(batch, counter) # increment counter only after
# the work is done
def increment_counter(batch, counter):
counter['curr'] += 1
if counter['curr'] == counter['total']:
batch.status = 'DONE'
batch.finished_on = datetime.now()
batch.save() # GOOD!
# Well, almost...
只要start_requests产生的所有请求具有不同的url,这都可以正常工作。
如果有重复项,scrapy会将其过滤掉,而不是调用parse
方法,
所以你最终得到counter['curr'] < counter['total']
,批处理状态永远是RUNNING。
事实证明,你可以覆盖scrapy的重复行为。
首先,我们需要更改settings.py以指定替代“duplicates filter”类:
DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'
然后我们创建MyDupeFilter
类,让蜘蛛知道何时出现重复:
class MyDupeFilter(RFPDupeFilter):
def log(self, request, spider):
super(MyDupeFilter, self).log(request, spider)
spider.look_a_dupe(request)
然后我们修改我们的蜘蛛,使它在找到重复时增加我们的计数器:
class PriceSpider(scrapy.Spider):
name = 'prices'
#...
def look_a_dupe(self, request):
batch = request.meta['batch']
counter = request.meta['counter']
self.increment_counter(batch, counter)
我们很高兴
答案 3 :(得分:0)
这是我的代码。两个解析器函数调用相同的 AfterParserFinished(),它计算调用次数以确定所有解析器完成的时间
countAccomplishedParsers: int = 0
def AfterParserFinished(self):
self.countAccomplishedParsers =self.countAccomplishedParsers+1
print self.countAccomplishedParsers #How many parsers have been accomplished
if self.countAccomplishedParsers == 2:
print("Accomplished: 2. Do something.")
def parse1(self, response):
self.AfterParserFinished()
pass
def parse2(self, response):
self.AfterParserFinished()
pass