获取scrapy不同部分的数据列表

时间:2018-09-28 02:38:45

标签: python asynchronous web-scraping scrapy

我需要在网站的不同部分抓取数据。在第一部分中,我从客户那里获取数据和您的订单ID。使用此ID,我可以访问第二部分并从订单中获取商品详细信息。因此,我需要将字典“ costumer”的结果与“ orders”列表和“ itens”列表连接在一起。基本上,我的算法是:

def parse1(self, response):
    costumer['data'] = response.xpath("path to costumer data").extract()
    costumer_orders = response.xpath("path to costumer orders")
    for index, costumer_order in enumarate(costumer_orders):
         id = costumer_order.xpath('path to order id').extract_first()
         costumer['orders'].append({'id' : id})
         yield scrapy.FormRequest(url="www.url.com/orders"+id, callback=self.parse2, method='GET', meta= {'costumer': costumer})

def parse2(self, response):
    costumer = response.meta['costumer']
    costumer['orders']['items'] = []  
    for index, order_item in response.xpath("path to order items"):
           costumer['orders']['items'].append({"items_details": order_item.xpath("path to items details").extract_first()})
    yield costumer

但是我无法使用Scrapy异步体系结构编写此逻辑。我得到的结果越接近,相同的客户命令就多次打印结果。有人可以帮助a吗?

1 个答案:

答案 0 :(得分:2)

由于您有一项的A和B类型请求,因此有两个链式请求要按顺序执行:首先对A进行爬网,然后对B进行N次爬网:

customer -> N order pages -> 1 item

所以您的抓取逻辑是:

  1. 获取客户数据
  2. 获取订单ID
    2.1弹出订单ID
    2.2抓取订单ID
    2.3将订单详细信息附加到#1客户数据
  3. 使用订单数据返回客户数据

在刮擦中看起来像:

def parse_customer(self, response):
    # find root customer data
    customer = {}
    # find order ids
    orders = [1,2,3]
    # schedule first order request and start order scraping loop
    first_order = order_url + orders.pop(0)
    yield Request(
        first_order, 
        self.parse_orders, 
        meta={'orders': orders, 'item': customer},
        )

def parse_orders(self, response):
    item = response.meta['item']
    remaining_orders = response.meta['orders']
    # first loop it's [1, 2] but finally it'll be []
    if not remaining_orders:  # all orders are scraped -> save item
        yield item
        return

    # attach found order details to root customer item we have
    found_orders = ...
    item['orders'].expand(found_orders)

    # scrape next order
    next_order = order_url + orders.pop(0),
    yield Request(
        next_order,
        self.parse_orders, 
        meta={'orders': orders, 'item': item},
        )