我需要在网站的不同部分抓取数据。在第一部分中,我从客户那里获取数据和您的订单ID。使用此ID,我可以访问第二部分并从订单中获取商品详细信息。因此,我需要将字典“ costumer”的结果与“ orders”列表和“ itens”列表连接在一起。基本上,我的算法是:
def parse1(self, response):
costumer['data'] = response.xpath("path to costumer data").extract()
costumer_orders = response.xpath("path to costumer orders")
for index, costumer_order in enumarate(costumer_orders):
id = costumer_order.xpath('path to order id').extract_first()
costumer['orders'].append({'id' : id})
yield scrapy.FormRequest(url="www.url.com/orders"+id, callback=self.parse2, method='GET', meta= {'costumer': costumer})
def parse2(self, response):
costumer = response.meta['costumer']
costumer['orders']['items'] = []
for index, order_item in response.xpath("path to order items"):
costumer['orders']['items'].append({"items_details": order_item.xpath("path to items details").extract_first()})
yield costumer
但是我无法使用Scrapy异步体系结构编写此逻辑。我得到的结果越接近,相同的客户命令就多次打印结果。有人可以帮助a吗?
答案 0 :(得分:2)
由于您有一项的A和B类型请求,因此有两个链式请求要按顺序执行:首先对A进行爬网,然后对B进行N次爬网:
customer -> N order pages -> 1 item
所以您的抓取逻辑是:
在刮擦中看起来像:
def parse_customer(self, response):
# find root customer data
customer = {}
# find order ids
orders = [1,2,3]
# schedule first order request and start order scraping loop
first_order = order_url + orders.pop(0)
yield Request(
first_order,
self.parse_orders,
meta={'orders': orders, 'item': customer},
)
def parse_orders(self, response):
item = response.meta['item']
remaining_orders = response.meta['orders']
# first loop it's [1, 2] but finally it'll be []
if not remaining_orders: # all orders are scraped -> save item
yield item
return
# attach found order details to root customer item we have
found_orders = ...
item['orders'].expand(found_orders)
# scrape next order
next_order = order_url + orders.pop(0),
yield Request(
next_order,
self.parse_orders,
meta={'orders': orders, 'item': item},
)