Scrapy。在解析结果中创建复杂结构(dict中的dict)

时间:2016-07-31 11:50:49

标签: parsing scrapy

我有几个描述对象属性的Item对象

import scrapy


class FullName(scrapy.Item):
    first = scrapy.Field()
    second = scrapy.Field()
    middle = scrapy.Field()

class Physical(scrapy.Item):
    growth = scrapy.Field()
    weight = scrapy.Field()
    hair = scrapy.Field()

我有物品,属于主题。作为字段,我想插入对象的Item属性

class Human(scrapy.Item):
    sex = scrapy.Field()
    age = scrapy.Field()
    physical = <...Physical Item>
    full_name = <...FullName Item>

所以当您将数据导出到具有指定嵌套的结构

{
age: 23,
sex: male,
full_name: {first: test, second: test, middle: test}
physical: {growth: 90, height: 190, hair: blonde},
...
}

嵌套可以达到任何深度。

我是用Scrapy做的吗?什么结构的蜘蛛?在关于extending itemloaders的scrapy文档中,我找不到。

或者我选择了错误的工具,我需要手动完成?

UPD。关于蜘蛛。

蜘蛛的结构是什么?如您所知,我们需要将“物理”字段与蜘蛛PhysicalSpider关联,后者传递当前URL。怎么样?请帮我。

class PhysicalSpider(scrapy.Spider):
    name = "physical"

    def parse(self, response):
         item = PhysicalItem()
         item['weight'] = response.xpath('path').extract()
         yield item

class HumanSpider(scrapy.Spider):
    name = "human"
    start_urls = [
        "url1",
        "url2",
     ]

    def parse(self, response):
         item = HumanItem()
         item['sex'] = response.xpath('path').extract()
         item['age'] = response.xpath('path')[1].extract()
         item['physical'] = PhysicalSpider(???)
         yield item

1 个答案:

答案 0 :(得分:1)

class Human(scrapy.Item):
    sex = scrapy.Field()
    physical = scrapy.Field()
    full_name = scrapy.Field()

class Physical(scrapy.Item):
    height = scrapy.Field() 

p = Physical()
p['height'] = 180
h = Human()
h['physical'] = p
h['sex'] = 'yes'
return h

结果:

{'physical': {'height': 180}, 'sex': 'yes'}

根据您的蜘蛛示例:

class HumanSpider(scrapy.Spider):
    name = "human"
    start_urls = [
        "url1",
     ]

    def parse(self, response):
         item = HumanItem()
         item['sex'] = response.xpath('path').extract()
         item['age'] = response.xpath('path')[1].extract()
         physical_item = Physicalitem()
         physical_item['height'] = response.xpath('path').extract()
         item['physical'] = physical_item
         yield item