我想只从“name”包含某些模式的页面获取数据 其他页面我想跳过。
现在是这样
def parse_item(self, response):
item=Item()
item['name']=response.xpath('//title//text()').extract().first()
if "pattern" not in item['name']:
return []
else:
return item
如何将其作为中间件?
答案 0 :(得分:2)
您应该专门使用Downloader Middleware
因为它提供的process_response
导入IgnoreRequest
class SkipMiddleware(object):
def process_response(self, request, response, spider):
if spider.name == 'myspider' and request.callback == spider.parse_item:
if 'pattern' not in response.xpath('//title//text()').extract_first():
raise IgnoreRequest
return response
请记住activate it
答案 1 :(得分:0)
伤心地回答我的问题,但我能做些什么......
def process_response(self,request, response, spider):
if not spider._rules:
return response
rule_index=request._meta.get('rule',None)
response_callback=None
if rule_index != None:
rule = spider._rules[rule_index]
response_callback=rule.callback
if response_callback and response_callback == getattr(spider,self.settings['PARSE_FUNCTION']):
## do something
return response