Question

我需要使用scrapy来抓取网页的所有内部网页链接，以便抓取例如www.stackovflow.com上的所有链接。这段代码工作：

   extractor = LinkExtractor(allow_domains=self.getBase(self.startDomain))

    for link in extractor.extract_links(response):
        self.registerUrl(link.url)

但是存在一个小问题，所有相对路径（例如/meta或/questions/ask）都不会被抓取，因为它不包含基本域stackoverflow.com。任何想法如何解决这一问题？

Answer 1

如果我正确理解了这个问题，你想使用scrapy.spidermiddlewares.offsite.OffsiteMiddleware https://doc.scrapy.org/en/latest/topics/spider-middleware.html#scrapy.spidermiddlewares.offsite.OffsiteMiddleware

过滤掉网址范围之外的网址请求蜘蛛。
This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute. All subdomains of any
列表中的
域也是允许的。例如。规则www.example.org 也将允许bob.www.example.org但不允许使用www2.example.com example.com。
When your spider returns a request for a domain not belonging to those covered by the spider, this middleware will log a debug message
与此相似：
DEBUG: Filtered offsite request to 'www.othersite.com': <GET http://www.othersite.com/some/page.html>

To avoid filling the log with too much noise, it will only print one of these messages for each new domain filtered. So, for example,
如果过滤了www.othersite.com的其他请求，则没有日志消息将被打印。但是，如果过滤someothersite.com的请求，则a 将打印消息（但仅对第一个请求进行过滤）。
If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests.

If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in
允许的域名。

我的理解是在过滤之前会对网址进行规范化。

Scrapy仅抓取内部链接，包括相对链接

1 个答案: