如何使用scrapy从已修改日期的站点地图抓取网址?

时间:2017-12-04 07:45:08

标签: python web-scraping scrapy sitemap

我正在尝试实施增量抓取工具,但在这种情况下,我不是要匹配网址,而是尝试匹配网站地图xml的属性,以检查网页是否被修改。现在问题是我无法找到解密的方法我应该在哪里拦截获取站点地图网址的请求,以便我可以添加逻辑来查看存储的<lastmod>值并仅返回那些值已更改的网址。

这里是xml:

<url>
<loc>https://www.example.com/hello?id=1</loc>
<lastmod>2017-12-03</lastmod>
<changefreq>Daily</changefreq>
<priority>1.0</priority>
</url>

Sitemap spider:

class ExampleSpider(SitemapSpider):
  name = "example"
  allowed_domains = []
  sitemap_urls = ["https://www.example.com/sitemaps.xml"]
  sitemap_rules = [
    ('/hello/', 'parse_data')
  ]

  def parse_data(self,response):
    pass

我的问题是:是否可以覆盖站点地图_parse_sitemap功能?截至目前,我发现scrapy的sitemap蜘蛛只查找<loc>属性。我可以使用process_request覆盖它,就像我们在普通蜘蛛中一样吗?

1 个答案:

答案 0 :(得分:1)

如果您所需要的只是获取lastmod的值,然后对满足某些条件的每个loc进行爬网,那么应该可以:

import scrapy

class ExampleSpider(scrapy.spiders.CrawlSpider):
  name = "example"
  start_urls = ["https://www.example.com/sitemaps.xml"]

  def parse(self, response):
    sitemap = scrapy.selector.XmlXPathSelector(response)
    sitemap.register_namespace(
      # ns is just a namespace and the second param should be whatever the 
      # xmlns of your sitemap is
      'ns', 'http://www.sitemaps.org/schemas/sitemap/0.9'
    )
    # this gets you a list of all the "loc" and "last modified" fields.
    locsList = sitemap.select('//ns:loc/text()').extract()
    lastModifiedList = sitemap.select('//ns:lastmod/text()').extract()

    # zip() the 2 lists together
    pageList = list(zip(locsList, lastModifiedList))

    for page in pageList:
      url, lastMod = page
      if r.search(r'\/hello\/', url) and lastMod # ... add the rest of your condition for list modified here:
        # crawl the url
        yield response.follow(url, callback=self.parse_data)

  def parse_data(self,response):
    pass