Question

设置

我有this page中的下一页按钮元素，

<li class="Pagination-item Pagination-item--next  Pagination-item--nextSolo ">
                        <button type="button" class="Pagination-link js-veza-stranica kist-FauxAnchor" data-page="2" data-href="https://www.njuskalo.hr/prodaja-kuca?page=2" role="link">Sljedeća&nbsp;<span aria-hidden="true" role="presentation">»</span></button>
                    </li>

我需要获取data-href属性中的网址。

代码

使用以下简单的xpath到scrapy shell中的button元素，

response.xpath('//*[@id="form_browse_detailed_search"]/div/div[1]/div[5]/div[1]/nav/ul/li[8]/button').extract_first()

我找回

'<button type="button" class="Pagination-link js-veza-stranica" data-page="2">Sljedeća\xa0<span aria-hidden="true" role="presentation">»</span></button>'

问题

data-href属性去哪里了？

如何获取网址？

Answer 1

(Get-Content "Path\test.txt") -replace ',',"`r`n" | Out-File "Path\test.txt"属性很可能是由浏览器中运行的某些JavaScript代码计算得出的。如果您查看此页面的原始源代码（浏览器中的“查看源代码”选项），则不会在该位置找到该属性。

在开发人员工具上看到的输出是浏览器呈现的DOM，因此可以预期浏览器视图与Scrapy实际获取的内容（原始HTML源代码）之间会有差异。请记住，Scrapy不会执行任何JavaScript代码。

无论如何，解决此问题的一种方法是基于data-href属性构建分页URL：

data-page

from w3lib.url import add_or_replace_parameter ... next_page = response.css('.Pagination-item--nextSolo button::attr(data-page)').get() next_page_url = add_or_replace_parameter(response.url, 'page', next_page)是一个开放源代码库：https://github.com/scrapy/w3lib

Href在scrapy结果中不可见，但在html中可见

1 个答案: