我需要使用scrapy来抓取网页的所有内部网页链接,以便抓取例如www.stackovflow.com上的所有链接。这段代码工作:
extractor = LinkExtractor(allow_domains=self.getBase(self.startDomain))
for link in extractor.extract_links(response):
self.registerUrl(link.url)
但是存在一个小问题,所有相对路径(例如/meta
或/questions/ask
)都不会被抓取,因为它不包含基本域stackoverflow.com
。任何想法如何解决这一问题?
答案 0 :(得分:1)
如果我正确理解了这个问题,你想使用scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware
过滤掉网址范围之外的网址请求 蜘蛛。
列表中的This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any
域也是允许的。例如。规则www.example.org 也将允许bob.www.example.org但不允许使用www2.example.com example.com。
When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message
与此相似:
DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html> To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example,
如果过滤了www.othersite.com的其他请求,则没有日志消息 将被打印。但是,如果过滤someothersite.com的请求,则a 将打印消息(但仅对第一个请求进行过滤)。
If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests. If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in
允许的域名。
我的理解是在过滤之前会对网址进行规范化。