Scrapy爬网站点包括非HTML项目

时间:2015-03-23 20:54:38

标签: python web-crawler scrapy scrapy-spider

我正在使用Scrapy抓取整个网站,包括图片,CSS,JavaScript和外部链接。我注意到Scrapy的默认CrawlSpider仅处理HTML响应并忽略外部链接。所以我尝试重写方法_requests_to_follow并在开头删除检查,但这不起作用。我也尝试使用方法process_request来允许所有请求,但也失败了。这是我的代码:

class MySpider(CrawlSpider):
    name = 'myspider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']

    rules = (Rule(LinkExtractor(), callback='parse_item', follow=False,
                  process_request='process_request'),)

    def parse_item(self, response):
        node = Node()
        node['location'] = response.url
        node['content_type'] = response.headers['Content-Type']
        yield node

        link = Link()
        link['source'] = response.request.headers['Referer']
        link['destination'] = response.url
        yield link

    def process_request(self, request):
        # Allow everything
        return request

    def _requests_to_follow(self, response):
        # There used to be a check for Scrapy's HtmlResponse response here
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                yield rule.process_request(r)

我们的想法是构建一个域图,这就是为什么我的parse_item生成一个带有资源位置和类型的Node对象,以及一个Link对象来跟踪节点之间的关系。外部页面应该检索其节点和链接信息,但当然不应该对它们进行爬网。

提前感谢您的帮助。

0 个答案:

没有答案