Question

我使用HTTPClient和Jsoup来获取网址并浏览网页。我有一个场景，其中单个链接有3或4页由分页控制。每个号码的onSubmit，发布操作并更改网址并完成导航。如何从主页面获取此网址？

这就是我的分页在UI中的存在方式

<div class="pagination">
<div class="label">Page: </div>         
<div class="button selected" onclick="$('.page-position', $(this).closest('form')).attr('value', $(this).html()); $(this).closest('form').submit();">1</div>        
<div class="button " onclick="$('.page-position', $(this).closest('form')).attr('value', $(this).html()); $(this).closest('form').submit();">2</div>
<div class="button " onclick="$('.page-position', $(this).closest('form')).attr('value', $(this).html()); $(this).closest('form').submit();">3</div>            
<div class="button" onclick="$('.page-position', $(this).closest('form')).attr('value', 2);$(this).closest('form').submit();">Next</div>
</div>

Answer 1

Jsoup解析静态HTML。 URL由Javascript / JQuery创建。所以你不能用Jsoup来做。您可以尝试HtmlUnit来创建页面并渲染Javascript，然后选择div很简单。

Answer 2

这取决于您尝试获取的页面。如今大多数网站都有非常好的结构化网址，所以它真的归结为解释网址是多么容易。你可以在firefox上使用firebug来获取css路径/ xpath并使用jsoup http://jsoup.org/cookbook/extracting-data/dom-navigation

另一方面，如果该网站有非结构化网址，那么只需像使用浏览器一样naviagate它，即来回走动。使用带有链接的第一页作为锚点，然后返回并前进。在Python上，您可以使用mechanize来完成此任务。

如何获得分页网址

2 个答案: