我注意到,rule
CrawlSpider
提取每个非页面页面上的网址。
我是否可以仅在当前页面满足某些条件时启用rule
(例如:url与正则表达式匹配)?
我有两页:
-------------------Page A-------------------
Page URL: http://www.site.com/pattern-match.html
--------------------------------------------
- [link](http://should-extract-this)
- [link](http://should-extract-this)
- [link](http://should-extract-this)
--------------------------------------------
--------------------Page B--------------------
Page URL: http://www.site.com/pattern-not-match.html
-----------------------------------------------
- [link](http://should-not-extract-this)
- [link](http://should-not-extract-this)
- [link](http://should-not-extract-this)
-----------------------------------------------
因此,规则应该只从PageA中提取网址。怎么做?谢谢!
答案 0 :(得分:1)
我刚发现了一种将response
注入rule
的肮脏方式。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from scrapy.http import Request, HtmlResponse
from scrapy.contrib.spiders import CrawlSpider, Rule
import inspect
class MyCrawlSpider(CrawlSpider):
def _requests_to_follow(self, response):
if not isinstance(response, HtmlResponse):
return
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
seen = seen.union(links)
for link in links:
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
# ***>>> HACK <<<***
# pass `response` as additional argument to `process_request`
fun = rule.process_request
if not hasattr(fun, 'nargs'):
fun.nargs = len(inspect.getargs(fun.func_code).args)
if fun.nargs==1:
yield fun(r)
elif fun.nargs==2:
yield fun(r, response)
else:
raise Exception('too many arguments')
尝试一下:
def process_request(request, response):
if 'magick' in response.url:
return request
class TestSpider(MyCrawlSpider):
name = 'test'
allowed_domains = ['test.com']
start_urls = ['http://www.test.com']
rules = [
Rule(SgmlLinkExtractor(restrict_xpaths='//a'), callback='parse_item', process_request=process_request),
]
def parse_item(self, response):
print response.url