Question

我有一只需要找到产品价格的蜘蛛。这些产品分批组合在一起（来自数据库），拥有批处理状态（RUNNING，DONE）以及start_time和finished_time属性会很不错。所以我有类似的东西：

class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            for prod in batch.get_products():
                yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # <-- NOT COOL: This is goind to 
                          # execute before the last product 
                          # url is scraped, right?

    def parse(self, response):
        #...

这里的问题是由于scrapy的异步性质，批处理对象的第二次状态更新将会运行得太快......对吗？有没有办法以某种方式将这些请求组合在一起，并在解析最后一个请求时更新批处理对象？

Answer 1

这是技巧

对于每个请求，请发送batch_id，total_products_in_this_batch和processed_this_batch

以及任何功能检查中的任何地方

for batch in Batches.objects.all():
    processed_this_batch = 0
    # TODO: Get some batch_id here
    # TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`

    for prod in batch.get_products():
        processed_this_batch  = processed_this_batch  + 1
        yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })

在代码的任何位置，对于任何特定批次，请检查if processed_this_batch == total_products_in_this_batch，然后保存批次

Answer 2

对于这种交易，你可以使用signal closed，你可以绑定一个函数，在蜘蛛完成爬行时运行。

Answer 3

我对@Umair建议进行了一些调整，并提出了一个适合我案例的解决方案：

class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            products = batch.get_products()
            counter = {'curr': 0, 'total': len(products)}  # the counter dictionary 
                                                           # for this batch
            for prod in products:
                yield scrapy.Request(product.get_scrape_url(), 
                                     meta={'prod': prod, 
                                           'batch': batch, 
                                           'counter': counter})
                                     # trick = add the counter in the meta dict

    def parse(self, response):
        # process the response as desired
        batch = response.meta['batch']
        counter = response.meta['counter']
        self.increment_counter(batch, counter) # increment counter only after 
                                               # the work is done

    def increment_counter(batch, counter):
        counter['curr'] += 1
        if counter['curr'] == counter['total']:
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # GOOD!
                          # Well, almost...

只要start_requests产生的所有请求具有不同的url，这都可以正常工作。

如果有重复项，scrapy会将其过滤掉，而不是调用parse方法，所以你最终得到counter['curr'] < counter['total']，批处理状态永远是RUNNING。

事实证明，你可以覆盖scrapy的重复行为。

首先，我们需要更改settings.py以指定替代“duplicates filter”类：

DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'

然后我们创建MyDupeFilter类，让蜘蛛知道何时出现重复：

class MyDupeFilter(RFPDupeFilter):
    def log(self, request, spider):
        super(MyDupeFilter, self).log(request, spider)
        spider.look_a_dupe(request)

然后我们修改我们的蜘蛛，使它在找到重复时增加我们的计数器：

class PriceSpider(scrapy.Spider):
    name = 'prices'

    #...

    def look_a_dupe(self, request):
        batch = request.meta['batch']
        counter = request.meta['counter']
        self.increment_counter(batch, counter)

我们很高兴

Answer 4

这是我的代码。两个解析器函数调用相同的 AfterParserFinished()，它计算调用次数以确定所有解析器完成的时间

countAccomplishedParsers: int = 0
        
def AfterParserFinished(self):
    self.countAccomplishedParsers =self.countAccomplishedParsers+1
    print self.countAccomplishedParsers #How many parsers have been accomplished
    if self.countAccomplishedParsers == 2:
        print("Accomplished: 2. Do something.")      


def parse1(self, response):  
    self.AfterParserFinished()
    pass

def parse2(self, response):  
    self.AfterParserFinished()
    pass

Scrapy：等待一些网址被解析，然后做点什么

4 个答案: