如何在scrapy的许多网站上获得单个项目?

时间:2017-09-25 19:44:20

标签: python scrapy

我遇到这种情况:

我想从描述产品的特定产品详细信息页面(页面A)抓取产品详细信息,此页面包含指向此产品的卖家(页面B)列表的页面的链接,每个卖家都链接到包含卖家详细信息的另一页(页面C),这是一个示例模式:

第A页:

  • PRODUCT_NAME
  • 链接到此产品的卖家(页面B)

Page B:

  • 卖家名单,每个卖家包含:
    • SELLER_NAME
    • selling_price
    • 链接至卖家详情页面(第C页)

第C页:

  • seller_address

这是我想在抓取后获得的json:

{
  "product_name": "product1",
  "sellers": [
    {
      "seller_name": "seller1",
      "seller_price": 100,
      "seller_address": "address1",
    },
    (...)
  ]
}

我尝试过:将产品信息从解析方法传递到元对象中的第二个解析方法,这在2个级别上工作正常,但我有3个,我想要一个项目。

这在scrapy中是否可行?

编辑:

这里要求的是我正在尝试做的一个缩小的例子,我知道它不会按预期工作,但我无法弄清楚如何让它只返回1个组合对象:

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'examplespider'
    allowed_domains = ["example.com"]

    start_urls = [
        'http://example.com/products/product1'
    ]

    def parse(self, response):

        # assume this object was obtained after
        # some xpath processing
        product_name = 'product1'
        link_to_sellers = 'http://example.com/products/product1/sellers'

        yield scrapy.Request(link_to_sellers, callback=self.parse_sellers, meta={
            'product': {
                'product_name': product_name,
                'sellers': []
            }
        })

    def parse_sellers(self, response):
        product = response.meta['product']

        # assume this object was obtained after
        # some xpath processing
        sellers = [
            {
                seller_name = 'seller1',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller1',
            },
            {
                seller_name = 'seller2',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller2',
            },
            {
                seller_name = 'seller3',
                seller_price = 100,
                seller_detail_url = 'http://example.com/sellers/seller3',
            }
        ]

        for seller in sellers:
            product['sellers'].append(seller)
            yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})

    def parse_seller(self, response):
        seller = response.meta['seller']

        # assume this object was obtained after
        # some xpath processing
        seller_address = 'seller_address1'

        seller['seller_address'] = seller_address

        yield seller

2 个答案:

答案 0 :(得分:0)

您需要更改一下逻辑,以便一次查询一个卖家地址,一旦完成,您就会查询其他卖家。

def parse_sellers(self, response):
    meta = response.meta

    # assume this object was obtained after
    # some xpath processing
    sellers = [
        {
            seller_name = 'seller1',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller1',
        },
        {
            seller_name = 'seller2',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller2',
        },
        {
            seller_name = 'seller3',
            seller_price = 100,
            seller_detail_url = 'http://example.com/sellers/seller3',
        }
    ]

    current_seller = sellers.pop()
    if current_seller:
       meta['pending_sellers'] = sellers
       meta['current_seller'] = current_seller
       yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
    else:
       yield product


    # for seller in sellers:
    #     product['sellers'].append(seller)
    #     yield scrapy.Request(seller['seller_detail_url'], callback=self.parse_seller, meta={'seller': seller})

def parse_seller(self, response):
    meta = response.meta
    current_seller = meta['current_seller']
    sellers = meta['pending_sellers']
    # assume this object was obtained after
    # some xpath processing
    seller_address = 'seller_address1'

    current_seller['seller_address'] = seller_address

    meta['product']['sellers'].append(current_seller)
    if sellers:
        current_seller = sellers.pop()
        meta['pending_sellers'] = sellers
        meta['current_seller'] = current_seller

        yield scrapy.Request(current_seller['seller_detail_url'], callback=self.parse_seller, meta=meta)
    else:
        yield meta['product']

但这不是一个好方法,因为卖家可能会卖多件物品。因此,当您再次由同一卖家联系到该商品时,您的卖家地址请求将被欺骗过滤器拒绝。您可以通过向请求添加dont_filter=True来解决此问题,但这意味着网站上有太多不必要的点击

所以你需要在代码中直接添加DB处理以检查你是否已经有卖家的详细信息,如果是,则使用它们,如果没有则需要获取详细信息。

答案 1 :(得分:0)

我认为pipeline可以提供帮助。

假设屈服seller采用以下格式(可以通过对代码的一些微不足道的修改来完成):

seller = {
    'product_name': 'product1',
    'seller': {
        'seller_name': 'seller1',
        'seller_price': 100,
        'seller_address': 'address1',
    }
}

以下管道将按其product_name收集卖家,并在抓取后导出到名为'items.jl'的文件(请注意,这只是该想法的草图,因此无法保证工作):

class CollectorPipeline(object):

    def __init__(self):
        self.collection = {}

    def open_spider(self, spider):
        self.collection = {}

    def close_spider(self, spider):
        with open("items.jl", "w") as fp:
            for _, product in self.collection.items():
                fp.write(json.dumps(product))
                fp.write("\n")

    def process_item(self, item, spider):
        product = self.collection.get(item["product_name"], dict())
        product["product_name"] = item["product_name"]
        sellers = product.get("sellers", list())
        sellers.append(item["seller"])

        return item

顺便说一下,您需要修改settings.py以使管道生效,如scrapy document中所述。