Scrapy - only scraping domain namess

时间:2019-03-14 11:40:05

标签: dns scrapy web-crawler

How can I implement scraping only domain names with Scrapy.

I am not interested in deep search of any domain.tld. My idea was only to use depth of 1 jump from index page of every domain - so direct links from homepage would be sufficient for links buffer.

I need as fast crawler as only can be.

I want to limit domains realm to .cz

Thank you.

1 个答案:

答案 0 :(得分:0)

您可以在DEPTH_LIMIT上使用SETTINGS参数,以将爬网限制为所需的深度。

https://docs.scrapy.org/en/latest/topics/settings.html?highlight=depth_limit

如果只想跳1个深度,则应设置DEPTH_LIMIT=2并使用selectorlink_extractor选择链接。

例如: response.xpath('//a/@href').re(r'.*.example.com.*')

https://docs.scrapy.org/en/latest/topics/selectors.html https://docs.scrapy.org/en/latest/topics/spiders.html?highlight=link_extractor