有时,我在抓取网站时不会返回带有主机名的网址(例如/ search / en或search / en)。我如何获得主机名,以便可以在提出请求之前将其添加?目前,我正在对其进行硬编码。
def parse_table(self, response):
for links in self._parse_xpath(response, 'table'):
for link in links:
# Annoying part, it's not dynamic and hardcoded, other
#functions also need to do this because of incomplete urls.
yield Request(url='https://domain.io' + link,
callback=self.parse_document_tab)
答案 0 :(得分:0)
您可以使用response.urljoin
方法将相对URL连接到基本URL:
def parse_table(self, response):
for links in self._parse_xpath(response, 'table'):
for link in links:
yield Request(url=response.urljoin(link),
callback=self.parse_document_tab)
或者使用全新的response.follow
(Scrapy 1.4.0+)方法,该方法将构建正确的绝对URL并返回一个Request
对象:
def parse_table(self, response):
for links in self._parse_xpath(response, 'table'):
for link in links:
yield response.follow(link, callback=self.parse_document_tab)