How can I implement scraping only domain names with Scrapy.
I am not interested in deep search of any domain.tld. My idea was only to use depth of 1 jump from index page of every domain - so direct links from homepage would be sufficient for links buffer.
I need as fast crawler as only can be.
I want to limit domains realm to .cz
Thank you.
答案 0 :(得分:0)
您可以在DEPTH_LIMIT
上使用SETTINGS
参数,以将爬网限制为所需的深度。
https://docs.scrapy.org/en/latest/topics/settings.html?highlight=depth_limit
如果只想跳1个深度,则应设置DEPTH_LIMIT=2
并使用selector
或link_extractor
选择链接。
例如:
response.xpath('//a/@href').re(r'.*.example.com.*')
https://docs.scrapy.org/en/latest/topics/selectors.html https://docs.scrapy.org/en/latest/topics/spiders.html?highlight=link_extractor