我正在使用Scrapy抓取整个网站,包括图片,CSS,JavaScript和外部链接。我注意到Scrapy的默认CrawlSpider
仅处理HTML响应并忽略外部链接。所以我尝试重写方法_requests_to_follow
并在开头删除检查,但这不起作用。我也尝试使用方法process_request
来允许所有请求,但也失败了。这是我的代码:
class MySpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com']
rules = (Rule(LinkExtractor(), callback='parse_item', follow=False,
process_request='process_request'),)
def parse_item(self, response):
node = Node()
node['location'] = response.url
node['content_type'] = response.headers['Content-Type']
yield node
link = Link()
link['source'] = response.request.headers['Referer']
link['destination'] = response.url
yield link
def process_request(self, request):
# Allow everything
return request
def _requests_to_follow(self, response):
# There used to be a check for Scrapy's HtmlResponse response here
seen = set()
for n, rule in enumerate(self._rules):
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
我们的想法是构建一个域图,这就是为什么我的parse_item
生成一个带有资源位置和类型的Node
对象,以及一个Link
对象来跟踪节点之间的关系。外部页面应该检索其节点和链接信息,但当然不应该对它们进行爬网。
提前感谢您的帮助。