Crawlspider解析并在途中从XML页面添加链接

时间:2016-03-19 09:50:06

标签: scrapy scrapy-spider

我已经根据自己的需求创建了一个crawlspider,它完美无缺。但是,我正在抓取的网站上的某些类别(并非全部)中存在某些xml站点地图。所以我希望有这个功能来解析这些类别的.xml站点地图并获取链接,然后将其留给crawlspider以深入了解这些链接。

我知道有一个SitemapSpider和XMLFeedSpider但是我需要使用带有XMLFeedSpider的crawlspider功能,反之亦然。

任何帮助都将不胜感激。

2 个答案:

答案 0 :(得分:3)

要使CrawlSpider与站点地图中的URL一起使用,您可以为XML响应制作自定义链接提取器,但它看起来像 CrawlSpider does not process XML responses。因此,您还需要覆盖_requests_to_follow以接受它们。

以下是我尝试使用sitemap.gz URL开始的示例蜘蛛(包含sitemapindex)

from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.link import Link
from scrapy.http import Request


class XmlLinkExtractor():

    def __init__(self, xpath, namespaces):
        self.xpath = xpath
        self.namespaces = namespaces

    def extract_links(self, response):
        selector = response.selector
        if self.namespaces:
            for i, ns in self.namespaces.items():
                selector.register_namespace(i, ns)
        for link in selector.xpath(self.xpath).extract():
            yield Link(link)


class ExampleSitemapCrawlSpider(CrawlSpider):
    name = "myspider"
    start_urls = (
        # link to a sitemap index file
        'http://www.example.com/sitemap.gz',

        # link to a sitemap file
        #'http://www.example.com/sitemaps/sitemap-general.xml',
        )
    rules = (

        # this handles sitemap indexes, following links to other sitemaps
        Rule(XmlLinkExtractor('/sm:sitemapindex/sm:sitemap/sm:loc/text()',
                {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}),),

        # this is for "leaf" pages in sitemaps
        Rule(XmlLinkExtractor('/sm:urlset/sm:url/sm:loc/text()',
                {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}),
            # here, defining the callback without follow=True
            # makes the crawler stop at these pages level,
            # not following deeper links
            # unset the callback if you want those pages
            # to go through other rules once downloaded
            callback='parse_loc'),
        # ... other rules
    )

    def _requests_to_follow(self, response):
        # we need to override `_requests_to_follow`
        # and comment these 2 lines, because they filter XML responses
        #if not isinstance(response, HtmlResponse):
        #    return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                yield rule.process_request(r)

    def parse_loc(self, response):
        self.logger.debug("parsing %r" % response)

根据您要从/urlset/url/loc解析网页的方式,您可能希望将不同的网址重定向到不同的回调(添加不同的规则,并自定义XmlLinkExtractor以允许过滤(或使用XPath进行过滤)

答案 1 :(得分:0)

您只需向当前的CrawlSpider添加规则并自行解析XML。您只需在rules顶部添加规则并在回调中修改sitemap_xml_xpath

import scrapy
import scrapy.linkextractors
import scrapy.spiders.crawl


class SmartlipoSpider(scrapy.spiders.crawl.CrawlSpider):
    name = "myspider"
    start_urls = ('http://example.com/',)
    rules = (
        scrapy.spiders.crawl.Rule(
            scrapy.linkextractors.LinkExtractor(
                allow=r'sitemap\.xml$',
            ),
            callback='parse_sitemap_xml', follow=True,
        ),
        # the other rules...
    )

    def parse_sitemap_xml(self, response):
        sitemap_xml_xpath = '/urlset/url'

        for url in response.xpath(sitemap_xml_xpath):
            yield scrapy.Request(url)

    # your other callbacks...