如何在解析时使用Item Loader时使用Item Pipeline?

时间:2017-01-10 13:51:31

标签: python scrapy

parse通过Item Loader的load_item方法返回项目时,Item Pipe行功能不起作用

def parse(self,response):
    DIV_SELECTOR = '.Content'
    SET_SELECTOR = '.Meta'

    for div in response.css(DIV_SELECTOR):            
        rowSelector = div.css(SET_SELECTOR)
        ItemAAA= ItemLoader(item=ItemAAA(), selector=rowSelector)
        ItemAAA.add_css('name','a ::text')
        ItemAAA.add_css('url','a ::attr(href)')
        return ItemAAA.load_item()

scrapy识别管道方法:

2017-01-10 18:25:48     [scrapy.middleware] INFO: Enabled item pipelines:  ['pipeline.DuplicatesPipeline']

parse函数返回一个dict时,管道工作:

def parse(self,response):
    for  tt in response.css(SET_SELECTOR):
            NAME_SELECTOR = 'a ::text'
            yield { 'name': tt.css(NAME_SELECTOR).extract_first(),
                   }

Pipeline.py

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

   def __init__(self):
       self.ids_seen = set()

   def process_item(self, item, spider):
       if item['name'] in self.ids_seen:
          raise DropItem("Duplicate item found: %s" % item)
       else:
          self.ids_seen.add(item['name'])
          return item

我在Windows 7中使用Python 3.5.2,scrapy 1.3通过Anaconda

1 个答案:

答案 0 :(得分:0)

由于您使用gsub语句打破了循环,因此您可能只在parse()方法中返回1个项目。要解决此问题,只需使用return而不是返回将您的方法转换为生成器:

yield