使用scrapy查找在网站中重复的网址数量

时间:2018-02-28 18:16:23

标签: python web-scraping scrapy scrapy-spider scrape

如何查找网站中重复的网址数量?因为杂乱的框架默认不会刮掉重复的URL。我只需要找到重复的URL和次数。

我尝试这样做,通过计算在函数关闭蜘蛛重复的URL数量,但经过一些挖掘后,我意识到我们不能在这个函数中产生任何东西。

2 个答案:

答案 0 :(得分:1)

如果您查看RFPDupeFilter here的来源,可以看到它记录了已过滤的请求数。

如果您修改子类中的log()方法,则可以轻松获取每个网址的结果。

这样的简单操作可能会起作用,或者您可能希望进一步优化它(确保设置DUPEFILTER_CLASS设置):

class URLStatsRFPDupeFilter(RFPDupeFilter):
    def log(self, request, spider):
        super().log(request, spider)
        spider.crawler.stats.inc_value(
            'dupefilter/filtered/{}'.format(request.url),
            spider=spider
        )

答案 1 :(得分:0)

This Scrapy Documentation可以帮助您入门。这段代码可以帮到你。

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        links_count = {}
        for link in response.css('a').xpath('@href').extract():
            if link in links_count:
                links_count[link] += 1
            else:
                links_count[link] = 1
        yield links_count

执行命令

scrapy runspider yourfilename.py

结果:

{' https://wordpress.org/': 1, 'https://github.com/scrapinghub': 1, 'https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples/#comments': 1, 'https://www.instagram.com/scrapinghub/': 1, 'https://blog.scrapinghub.com/2016/11/10/how-you-can-use-web-data-to-accelerate-your-startup/#comments': 1, 'https://blog.scrapinghub.com/2017/07/07/scraping-the-steam-game-store-with-scrapy/': 4, 'https://scrapinghub.com/': 1, 'https://blog.scrapinghub.com/2017/11/05/a-faster-updated-scrapinghub/#comments': 1, 'https://www.youtube.com/channel/UCYb6YWTBfD0EB53shkN_6vA': 1, 'https://blog.scrapinghub.com/2017/11/05/a-faster-updated-scrapinghub/': 4, 'https://www.facebook.com/ScrapingHub/': 1, 'https://blog.scrapinghub.com/2016/11/10/how-you-can-use-web-data-to-accelerate-your-startup/': 3, 'https://blog.scrapinghub.com/author/andre-perunicic/': 1, 'http://blog.scrapinghub.com/rss': 1, 'https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/': 1, 'https://blog.scrapinghub.com/2016/11/24/how-to-build-your-own-price-monitoring-tool/': 4, 'https://blog.scrapinghub.com/page/2/': 1, 'https://scrapinghub.com/data-on-demand': 1, 'https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/': 1, 'https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition/': 1, 'https://blog.scrapinghub.com/author/kmike84/': 1, 'https://blog.scrapinghub.com/author/cchaynessh/': 3, 'https://blog.scrapinghub.com/2016/11/24/how-to-build-your-own-price-monitoring-tool/#comments': 1, 'https://blog.scrapinghub.com/about/': 1, 'https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016/': 1, 'https://www.linkedin.com/company/scrapinghub': 1, 'https://blog.scrapinghub.com/2017/06/19/do-androids-dream-of-electric-sheep/#respond': 1, 'https://blog.scrapinghub.com/author/valdir/': 3, 'https://plus.google.com/+Scrapinghub': 1, 'https://blog.scrapinghub.com/author/scott/': 2, 'https://scrapinghub.com/data-services/': 1, 'https://blog.scrapinghub.com/': 2, 'https://blog.scrapinghub.com/2017/04/19/deploy-your-scrapy-spiders-from-github/': 4, 'https://blog.scrapinghub.com/2017/01/01/looking-back-at-2016/': 3, 'https://blog.scrapinghub.com/2017/12/31/looking-back-at-2017/#comments': 1, 'https://blog.scrapinghub.com/2016/12/15/how-to-increase-sales-with-online-reputation-management/#comments': 1, 'https://twitter.com/scrapinghub': 1, 'https://blog.scrapinghub.com/2016/12/15/how-to-increase-sales-with-online-reputation-management/': 3, 'https://blog.scrapinghub.com/2017/06/19/do-androids-dream-of-electric-sheep/': 4, 'https://blog.scrapinghub.com/2017/01/01/looking-back-at-2016/#comments': 1, 'https://blog.scrapinghub.com/2017/12/31/looking-back-at-2017/': 4, 'https://wordpress.org/themes/nisarg/': 1, 'https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples/': 3}