Question

我正在尝试在脚本中添加抓取页面的当前URL。但是由于某些原因，我无法选择此选项：

<link rel="canonical" href="https://www.cdiscount.com/sante-mieux-vivre/hygiene-beaute-parapharmacie-bio/v-16516-16516.html" />

它嵌套在head中。

我尝试了response.xpath("//head/link[@rel='canonical']@href").extract()

我在做什么错了？

Answer 1

如果只需要当前响应的网址。您可以只使用response.url

Answer 2

如果您确实需要规范的URL，这应该可以：

response.xpath("//link[@rel='canonical']/@href").get()

您的表达式在/之前缺少@href。

您还可以使用CSS：

response.css("link[rel='canonical']::attr(href)").get()

如果您不关心规范的URL，则可以遵循上面@Yall的建议。