Scrapy:等待一些网址被解析,然后做点什么

时间:2017-02-14 13:30:51

标签: python scrapy

我有一只需要找到产品价格的蜘蛛。这些产品分批组合在一起(来自数据库),拥有批处理状态(RUNNING,DONE)以及start_timefinished_time属性会很不错。 所以我有类似的东西:

class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            for prod in batch.get_products():
                yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod})
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # <-- NOT COOL: This is goind to 
                          # execute before the last product 
                          # url is scraped, right?

    def parse(self, response):
        #...

这里的问题是由于scrapy的异步性质,批处理对象的第二次状态更新将会运行得太快......对吗? 有没有办法以某种方式将这些请求组合在一起,并在解析最后一个请求时更新批处理对象?

4 个答案:

答案 0 :(得分:2)

这是技巧

对于每个请求,请发送batch_idtotal_products_in_this_batchprocessed_this_batch

以及任何功能检查中的任何地方

for batch in Batches.objects.all():
    processed_this_batch = 0
    # TODO: Get some batch_id here
    # TODO: Find a way to check total number of products in this batch and assign to `total_products_in_this_batch`

    for prod in batch.get_products():
        processed_this_batch  = processed_this_batch  + 1
        yield scrapy.Request(product.get_scrape_url(), meta={'prod': prod, 'batch_id': batch_id, `total_products_in_this_batch`: total_products_in_this_batch, 'processed_this_batch': processed_this_batch })

在代码的任何位置,对于任何特定批次,请检查if processed_this_batch == total_products_in_this_batch,然后保存批次

答案 1 :(得分:1)

对于这种交易,你可以使用signal closed,你可以绑定一个函数,在蜘蛛完成爬行时运行。

答案 2 :(得分:0)

我对@Umair建议进行了一些调整,并提出了一个适合我案例的解决方案:

class PriceSpider(scrapy.Spider):
    name = 'prices'

    def start_requests(self):
        for batch in Batches.objects.all():
            batch.started_on = datetime.now()
            batch.status = 'RUNNING'
            batch.save()
            products = batch.get_products()
            counter = {'curr': 0, 'total': len(products)}  # the counter dictionary 
                                                           # for this batch
            for prod in products:
                yield scrapy.Request(product.get_scrape_url(), 
                                     meta={'prod': prod, 
                                           'batch': batch, 
                                           'counter': counter})
                                     # trick = add the counter in the meta dict

    def parse(self, response):
        # process the response as desired
        batch = response.meta['batch']
        counter = response.meta['counter']
        self.increment_counter(batch, counter) # increment counter only after 
                                               # the work is done

    def increment_counter(batch, counter):
        counter['curr'] += 1
        if counter['curr'] == counter['total']:
            batch.status = 'DONE'
            batch.finished_on = datetime.now()
            batch.save()  # GOOD!
                          # Well, almost...

只要start_requests产生的所有请求具有不同的url,这都可以正常工作。

如果有重复项,scrapy会将其过滤掉,而不是调用parse方法, 所以你最终得到counter['curr'] < counter['total'],批处理状态永远是RUNNING。

事实证明,你可以覆盖scrapy的重复行为。

首先,我们需要更改settings.py以指定替代“duplicates filter”类:

DUPEFILTER_CLASS = 'myspiders.shopping.MyDupeFilter'

然后我们创建MyDupeFilter类,让蜘蛛知道何时出现重复:

class MyDupeFilter(RFPDupeFilter):
    def log(self, request, spider):
        super(MyDupeFilter, self).log(request, spider)
        spider.look_a_dupe(request)

然后我们修改我们的蜘蛛,使它在找到重复时增加我们的计数器:

class PriceSpider(scrapy.Spider):
    name = 'prices'

    #...

    def look_a_dupe(self, request):
        batch = request.meta['batch']
        counter = request.meta['counter']
        self.increment_counter(batch, counter)

我们很高兴

答案 3 :(得分:0)

这是我的代码。两个解析器函数调用相同的 AfterParserFinished(),它计算调用次数以确定所有解析器完成的时间

countAccomplishedParsers: int = 0
        
def AfterParserFinished(self):
    self.countAccomplishedParsers =self.countAccomplishedParsers+1
    print self.countAccomplishedParsers #How many parsers have been accomplished
    if self.countAccomplishedParsers == 2:
        print("Accomplished: 2. Do something.")      


def parse1(self, response):  
    self.AfterParserFinished()
    pass

def parse2(self, response):  
    self.AfterParserFinished()
    pass