我已经根据自己的需求创建了一个crawlspider,它完美无缺。但是,我正在抓取的网站上的某些类别(并非全部)中存在某些xml站点地图。所以我希望有这个功能来解析这些类别的.xml站点地图并获取链接,然后将其留给crawlspider以深入了解这些链接。
我知道有一个SitemapSpider和XMLFeedSpider但是我需要使用带有XMLFeedSpider的crawlspider功能,反之亦然。
任何帮助都将不胜感激。
答案 0 :(得分:3)
要使CrawlSpider与站点地图中的URL一起使用,您可以为XML响应制作自定义链接提取器,但它看起来像
CrawlSpider
does not process XML responses。因此,您还需要覆盖_requests_to_follow
以接受它们。
以下是我尝试使用sitemap.gz
URL开始的示例蜘蛛(包含sitemapindex)
from scrapy.spiders.crawl import CrawlSpider, Rule
from scrapy.link import Link
from scrapy.http import Request
class XmlLinkExtractor():
def __init__(self, xpath, namespaces):
self.xpath = xpath
self.namespaces = namespaces
def extract_links(self, response):
selector = response.selector
if self.namespaces:
for i, ns in self.namespaces.items():
selector.register_namespace(i, ns)
for link in selector.xpath(self.xpath).extract():
yield Link(link)
class ExampleSitemapCrawlSpider(CrawlSpider):
name = "myspider"
start_urls = (
# link to a sitemap index file
'http://www.example.com/sitemap.gz',
# link to a sitemap file
#'http://www.example.com/sitemaps/sitemap-general.xml',
)
rules = (
# this handles sitemap indexes, following links to other sitemaps
Rule(XmlLinkExtractor('/sm:sitemapindex/sm:sitemap/sm:loc/text()',
{"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}),),
# this is for "leaf" pages in sitemaps
Rule(XmlLinkExtractor('/sm:urlset/sm:url/sm:loc/text()',
{"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}),
# here, defining the callback without follow=True
# makes the crawler stop at these pages level,
# not following deeper links
# unset the callback if you want those pages
# to go through other rules once downloaded
callback='parse_loc'),
# ... other rules
)
def _requests_to_follow(self, response):
# we need to override `_requests_to_follow`
# and comment these 2 lines, because they filter XML responses
#if not isinstance(response, HtmlResponse):
# return
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
if links and rule.process_links:
links = rule.process_links(links)
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
r.meta.update(rule=n, link_text=link.text)
yield rule.process_request(r)
def parse_loc(self, response):
self.logger.debug("parsing %r" % response)
根据您要从/urlset/url/loc
解析网页的方式,您可能希望将不同的网址重定向到不同的回调(添加不同的规则,并自定义XmlLinkExtractor
以允许过滤(或使用XPath进行过滤)
答案 1 :(得分:0)
您只需向当前的CrawlSpider添加规则并自行解析XML。您只需在rules
顶部添加规则并在回调中修改sitemap_xml_xpath
:
import scrapy
import scrapy.linkextractors
import scrapy.spiders.crawl
class SmartlipoSpider(scrapy.spiders.crawl.CrawlSpider):
name = "myspider"
start_urls = ('http://example.com/',)
rules = (
scrapy.spiders.crawl.Rule(
scrapy.linkextractors.LinkExtractor(
allow=r'sitemap\.xml$',
),
callback='parse_sitemap_xml', follow=True,
),
# the other rules...
)
def parse_sitemap_xml(self, response):
sitemap_xml_xpath = '/urlset/url'
for url in response.xpath(sitemap_xml_xpath):
yield scrapy.Request(url)
# your other callbacks...