Question

有时，我在抓取网站时不会返回带有主机名的网址（例如/ search / en或search / en）。我如何获得主机名，以便可以在提出请求之前将其添加？目前，我正在对其进行硬编码。

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            # Annoying part, it's not dynamic and hardcoded, other 
            #functions also need to do this because of incomplete urls.
            yield Request(url='https://domain.io' + link,
                        callback=self.parse_document_tab)

Answer 1

您可以使用response.urljoin方法将相对URL连接到基本URL：

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            yield Request(url=response.urljoin(link),
                          callback=self.parse_document_tab)

或者使用全新的response.follow（Scrapy 1.4.0+）方法，该方法将构建正确的绝对URL并返回一个Request对象：

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            yield response.follow(link, callback=self.parse_document_tab)

如何获取请求的主机名？

1 个答案: