如何获取请求的主机名?

时间:2018-09-10 12:41:13

标签: python scrapy

有时,我在抓取网站时不会返回带有主机名的网址(例如/ search / en或search / en)。我如何获得主机名,以便可以在提出请求之前将其添加?目前,我正在对其进行硬编码。

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            # Annoying part, it's not dynamic and hardcoded, other 
            #functions also need to do this because of incomplete urls.
            yield Request(url='https://domain.io' + link,
                        callback=self.parse_document_tab)

1 个答案:

答案 0 :(得分:0)

您可以使用response.urljoin方法将相对URL连接到基本URL:

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            yield Request(url=response.urljoin(link),
                          callback=self.parse_document_tab)

或者使用全新的response.follow(Scrapy 1.4.0+)方法,该方法将构建正确的绝对URL并返回一个Request对象:

def parse_table(self, response):
    for links in self._parse_xpath(response, 'table'):
        for link in links:
            yield response.follow(link, callback=self.parse_document_tab)