Question

我正在抓的网站有多个具有相同ID但价格不同的产品。我想只保留最低价格版本。

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = dict()

    def process_item(self, item, spider):
        if item['ID'] in self.ids_seen:
            if item['sale_price']>self.ids_seen[item['ID']]:
                raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['ID'])
            return item

因此，此代码应丢弃价格高于之前价格的商品，但如果价格较低，我无法弄清楚如何更新之前已删除的商品。

# -*- coding: utf-8 -*-
import scrapy
import urlparse
import re

class ExampleSpider(scrapy.Spider):
    name = 'name'
    allowed_domains = ['domain1','domain2']
    start_urls = ['url1','url2']

    def parse(self, response):
        for href in response.css('div.catalog__main__content .c-product-card__name::attr("href")').extract():
            url = urlparse.urljoin(response.url, href) 
            yield scrapy.Request(url=url, callback=self.parse_product)

    # follow pagination links
        href = response.css('.c-paging__next-link::attr("href")').extract_first()
        if href is not None:
            url = urlparse.urljoin(response.url, href) 
            yield scrapy.Request(url=url, callback=self.parse)
    def parse_product(self, response):
       # process the response here (omitted because it's long and doesn't add anything)
        yield {
            'product-name': name,
            'price-sale': price_sale,
            'price-regular': price_regular[:-1],
            'raw-sku': raw_sku,
            'sku': sku.replace('_','/'),
            'img': response.xpath('//img[@class="itm-img"]/@src').extract()[-1],
            'description': response.xpath('//div[@class="product-description__block"]/text()').extract_first(),
            'url' : response.url,
        }

Answer 1

由于管道正在进行中，您无法通过管道执行此操作。换句话说，它会在不等蜘蛛完成的情况下返回项目。

但是，如果您有数据库，则可以解决此问题：

在semy-pseudo代码中：

class DbPipeline(object):

    def __init__(self):
        self.connection = # connect to your database

    def process_item(self, item, spider):
        db_item = self.connection.get(item['ID'])
        if item['price'] < db_item['price']:
            self.connection.remove(item['ID'])
            self.connection.add(item)
        return item

您仍会在scrapy输出中获得未经过滤的结果，但您的数据库将被订购个人建议是使用基于文档的数据库，这是数据库的关键，例如redis。

Answer 2

在开始之前，您知道产品ID吗？如果是这样，那么正常的网站行为将允许您搜索低于＆gt;的价格，因此您可以刮取为每个产品ID返回的第一个项目，这将避免任何管道处理的需要。

如果您不这样做，那么您可以执行两个步骤，首先搜索所有产品以获取Id，然后针对每个Id执行上述过程。

我如何只保留Scrapy中最低价的商品？

2 个答案: