Scrapy 嵌套解析

时间:2021-02-07 14:33:27

标签: python web-scraping scrapy

我想在蜘蛛中实现“子解析”,这取决于我在“主解析”方法中收到的每个项目的 URL。最后,所有项目,包括来自子请求的信息,都应该放在一个“容器项目”中。

假设项目描述如下:

class UnitContainer(scrapy.Item):
  title = scrapy.Field()
  unit_list = scrapy.Field()

class Unit(scrapy.Item):
  name = scrapy.Field()
  foreign_info = scrapy.Field()

我的解析尝试如下:

def parse(self, response):
   container = UnitContainer()
   unit_list = []

   container["name"] = response.xpath("xpath_to_cotainer_name_value").extract_first()

   for row in response.xpath("x_path_to_unit_lines"):
     unit = Unit()
     unit["name"] = row.xpath("xpath_to_unit_name_value).extract_first()
     foreign_url = row.xpath("xpath_to_foreign_url_with_more_info).extract_first()

     unit = yield scrapy.Request(url=foreign_url, callback=self.parse_foreign, meta){'unit':unit})
     unit_list.append(unit)

    container["unit_list"] = unit_list
    yield container

def parse_foreign(self, response):
    unit = response.meta.get('unit')
    unit['foreing_info'] = response.xpath("xpath_to_other_info").extract_first()
    yield unit

我对这段代码的问题是,子解析的输出没有返回到主解析,而是直接返回到引擎。因此,我收到一个带有空值的容器,然后将正确的单位作为输出中的单个项目。这个例子应该可以清楚地说明输出问题:

[
{"name": "MyContainer", "unit_list": [null, null, null]},
{"name": "Unit1", "foreign_info": "blablabla1"},
{"name": "Unit2", "foreign_info": "blablabla2"},
{"name": "Unit3", "foreign_info": "blablabla3"}
]

我认为这是因为我错误地使用了 yield,但我没有找到正确的方法来解决这个问题,

0 个答案:

没有答案