scrapy itemloaders返回项目列表

时间:2015-03-15 01:55:59

标签: scrapy scrapy-spider

def parse:
    for link in   LinkExtractor(restrict_xpaths="BLAH",).extract_links(response)[:-1]:
            yield Request(link.url)
    l = MytemsLoader()
    l.add_value('main1', some xpath)
    l.add_value('main2', some xpath)
    l.add_value('main3', some xpath)

     rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]")
     for row in rows:
         l.add_value('table1', some xpath based on rows)
         l.add_value('table2', some xpath based on rows)
         l.add_value('main3', some xpath based on rows)
         yield l.loaditem()

我正在使用一个项目加载器,因为我想预处理这些字段并轻松处理任何空值。 该表的每一行应该是一个具有main1,2,3 ......等字段加上自己的字段的实体。 但是,上面的代码会覆盖l itemloader,只返回每个主页的最后一行。

问题: 如何使用项目加载器将主页面数据与每个表格行条目组合?如果我为每个部分使用了2个项目加载器,那么它们如何组合?

供将来参考:

def newparse:
    for link in   LinkExtractor(restrict_xpaths="BLAH",).extract_links(response)[:-1]:
            yield Request(link.url)
    ml = MyitemLoader()
    ml.add_value('main1', some xpath)
    ml.add_value('main2', some xpath)
    ml.add_value('main3', some xpath)
    main_item = ml.load_item()
     rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]")
     for row in rows:
         bl = MyitemLoader(item=main_item, selector=row)
         bl.add_value('table1', some xpath based on row)
         bl.add_value('table2', some xpath based on row)
         bl.add_value('main3', some xpath based on row)
         yield bl.loaditem()             

1 个答案:

答案 0 :(得分:4)

您需要在循环中实例化一个新的ItemLoader,并提供item argument

l = MytemsLoader()
l.add_value('main1', some xpath)
l.add_value('main2', some xpath)
l.add_value('main3', some xpath)
item = l.loaditem()

rows = response.xpath("table[@id='BLAH']/tbody[contains(@id, 'BLOB')]")
for row in rows:
    l = MytemsLoader(item=item)

    l.add_value('table1', some xpath based on rows)
    l.add_value('table2', some xpath based on rows)
    l.add_value('main3', some xpath based on rows)

    yield l.loaditem()