scrapy中间件跳过一些页面

时间:2016-06-08 22:59:35

标签: python scrapy

我想只从“name”包含某些模式的页面获取数据 其他页面我想跳过。

现在是这样

def parse_item(self, response):
  item=Item()
  item['name']=response.xpath('//title//text()').extract().first()
  if "pattern" not in item['name']:
    return []
  else:
    return item

如何将其作为中间件?

2 个答案:

答案 0 :(得分:2)

您应该专门使用Downloader Middleware因为它提供的process_response

来自scrapy.exceptions的

导入IgnoreRequest

class SkipMiddleware(object):
    def process_response(self, request, response, spider):
        if spider.name == 'myspider' and request.callback == spider.parse_item:
            if 'pattern' not in response.xpath('//title//text()').extract_first():
            raise IgnoreRequest
        return response

请记住activate it

答案 1 :(得分:0)

伤心地回答我的问题,但我能做些什么......

   def process_response(self,request, response, spider):
        if not spider._rules:
            return response
        rule_index=request._meta.get('rule',None)

        response_callback=None
        if rule_index != None:
            rule = spider._rules[rule_index]
            response_callback=rule.callback


        if response_callback and response_callback == getattr(spider,self.settings['PARSE_FUNCTION']):
            ## do something
        return response