如何查找网站中重复的网址数量?因为杂乱的框架默认不会刮掉重复的URL。我只需要找到重复的URL和次数。
我尝试这样做,通过计算在函数关闭蜘蛛重复的URL数量,但经过一些挖掘后,我意识到我们不能在这个函数中产生任何东西。
答案 0 :(得分:1)
如果您查看RFPDupeFilter
here的来源,可以看到它记录了已过滤的请求数。
如果您修改子类中的log()
方法,则可以轻松获取每个网址的结果。
这样的简单操作可能会起作用,或者您可能希望进一步优化它(确保设置DUPEFILTER_CLASS
设置):
class URLStatsRFPDupeFilter(RFPDupeFilter):
def log(self, request, spider):
super().log(request, spider)
spider.crawler.stats.inc_value(
'dupefilter/filtered/{}'.format(request.url),
spider=spider
)
答案 1 :(得分:0)
This Scrapy Documentation可以帮助您入门。这段代码可以帮到你。
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.scrapinghub.com']
def parse(self, response):
links_count = {}
for link in response.css('a').xpath('@href').extract():
if link in links_count:
links_count[link] += 1
else:
links_count[link] = 1
yield links_count
执行命令
scrapy runspider yourfilename.py
结果:
{' https://wordpress.org/': 1, 'https://github.com/scrapinghub': 1, 'https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples/#comments': 1, 'https://www.instagram.com/scrapinghub/': 1, 'https://blog.scrapinghub.com/2016/11/10/how-you-can-use-web-data-to-accelerate-your-startup/#comments': 1, 'https://blog.scrapinghub.com/2017/07/07/scraping-the-steam-game-store-with-scrapy/': 4, 'https://scrapinghub.com/': 1, 'https://blog.scrapinghub.com/2017/11/05/a-faster-updated-scrapinghub/#comments': 1, 'https://www.youtube.com/channel/UCYb6YWTBfD0EB53shkN_6vA': 1, 'https://blog.scrapinghub.com/2017/11/05/a-faster-updated-scrapinghub/': 4, 'https://www.facebook.com/ScrapingHub/': 1, 'https://blog.scrapinghub.com/2016/11/10/how-you-can-use-web-data-to-accelerate-your-startup/': 3, 'https://blog.scrapinghub.com/author/andre-perunicic/': 1, 'http://blog.scrapinghub.com/rss': 1, 'https://blog.scrapinghub.com/2016/08/25/how-to-crawl-the-web-politely-with-scrapy/': 1, 'https://blog.scrapinghub.com/2016/11/24/how-to-build-your-own-price-monitoring-tool/': 4, 'https://blog.scrapinghub.com/page/2/': 1, 'https://scrapinghub.com/data-on-demand': 1, 'https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/': 1, 'https://blog.scrapinghub.com/2016/04/20/scrapy-tips-from-the-pros-april-2016-edition/': 1, 'https://blog.scrapinghub.com/author/kmike84/': 1, 'https://blog.scrapinghub.com/author/cchaynessh/': 3, 'https://blog.scrapinghub.com/2016/11/24/how-to-build-your-own-price-monitoring-tool/#comments': 1, 'https://blog.scrapinghub.com/about/': 1, 'https://blog.scrapinghub.com/2016/06/22/scrapy-tips-from-the-pros-june-2016/': 1, 'https://www.linkedin.com/company/scrapinghub': 1, 'https://blog.scrapinghub.com/2017/06/19/do-androids-dream-of-electric-sheep/#respond': 1, 'https://blog.scrapinghub.com/author/valdir/': 3, 'https://plus.google.com/+Scrapinghub': 1, 'https://blog.scrapinghub.com/author/scott/': 2, 'https://scrapinghub.com/data-services/': 1, 'https://blog.scrapinghub.com/': 2, 'https://blog.scrapinghub.com/2017/04/19/deploy-your-scrapy-spiders-from-github/': 4, 'https://blog.scrapinghub.com/2017/01/01/looking-back-at-2016/': 3, 'https://blog.scrapinghub.com/2017/12/31/looking-back-at-2017/#comments': 1, 'https://blog.scrapinghub.com/2016/12/15/how-to-increase-sales-with-online-reputation-management/#comments': 1, 'https://twitter.com/scrapinghub': 1, 'https://blog.scrapinghub.com/2016/12/15/how-to-increase-sales-with-online-reputation-management/': 3, 'https://blog.scrapinghub.com/2017/06/19/do-androids-dream-of-electric-sheep/': 4, 'https://blog.scrapinghub.com/2017/01/01/looking-back-at-2016/#comments': 1, 'https://blog.scrapinghub.com/2017/12/31/looking-back-at-2017/': 4, 'https://wordpress.org/themes/nisarg/': 1, 'https://blog.scrapinghub.com/2016/10/27/an-introduction-to-xpath-with-examples/': 3}