Question

我有以下HTML：

    <p>                         
     <a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>   
    <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>   
    <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>   
    <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>    
    </p>

此部分来自网页http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861

一段代码：

    Document document = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get(); 
    String tag = null;
    for (Element element : document.select("*") ) { 
        tag = element.tagName();

        if ( "a".equalsIgnoreCase( tag ) ) {
            LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
        }


        if ( StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah") ) {
            LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
            LOGGER.info("tag : {}; nextNodeSibling: {}", tag, element.nextSibling() );
            LOGGER.info("element : {}; previousElementSibling: {}", element.ownText(), element.previousElementSibling() );
        }

}

我得到的输出：

    element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: null
    tag : h2; nextNodeSibling:  
    element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null

有很多问题：

从主HTML源代码中有许多标记为a的元素，但没有来自我正在检查的小HTML部分
似乎<a>被捕获为<h2>
element.nextElementSibling()在大多数情况下为空

然而，如果仅针对小片进行测试，问题就会消失。因此，当Jsoup出现在更大的HTML源代码中时，它似乎无法正确识别标签。

知道为什么吗？

感谢。

编辑2

练习背后的意图是清理网页。这就是我遍历整个HTML的原因，而不是@Stephan建议的特定部分。我只挑选了一个看起来有问题的特定部分。

但在检查了@luksch的回复之后，我重新查看了原始的HTML，发现了拍摄中的异常现象。代码全部查看所有标记，但是a例外。在主要来源中，我们article后跟a，figure（其中包含i，img，img，small ，small），h2。问题似乎是所有标记（a除外）都被删除（根据需要起作用）但是text被遗忘了。这就是为什么我最终被遗留在原始HTML源代码中的<a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>。

Jill Martin从她的客房里拯救Savannah Guthrie 是来自<h2>的文字，但是<h2>被删除并留下了文字。有趣的是，Jsoup仍然认为文本来自h2，尽管最终输出没有h2。

Answer 1

您提供的网址包含以下元素：

<a class="player-tease-link" href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959">
<figure class="player-tease">
  <i class="player-tease-icon icon-video-play"></i>
  <img class="tease-icon-play" src="http://nodeassets.today.com/img/svg/641a740d.video-play-white.svg" alt="Play">
  <img class="tease-image" src="http://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/tdy_guth_clutter_160120.today-vid-post-small-desktop.jpg" title="Jill Martin rescues Savannah Guthrie from her guest room mess" alt="Jill Martin rescues Savannah Guthrie from her guest room mess">
  <small class="tease-sponsored">Sponsored Content</small>
  <small class="tease-playing">Now Playing</small>
</figure>
<h2 class="player-tease-headline">Jill Martin rescues Savannah Guthrie from her guest room mess</h2>
</a>

因此，您似乎确实将苹果与橙子进行了比较，这意味着您提供的html代码段不是原始HTML的一部分。我猜你使用了一些已经改变HTML的提取工具。请注意，a元素不包含任何自己的文本！

一个好主意是遵循@Stephan的建议并学习如何使用CSS selectors properly。这应该比选择all然后在程序代码中手动过滤更有效。以下是您可以做的一个示例：

 Elements interestingAs = document.select("a:matches(^Jill Martin)");

这将选择包含以＆＃34开头的文字的所有a元素; Jill Martin＆＃34;。

Answer 2

我认为选择器需要更具体。

而不是document.select("*")，请尝试document.select("a")。

Answer 3

这对我来说是不可复制的。以下程序准确打印出您期望的内容：

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>
tag : a; nextNodeSibling:  
element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null
element : 4 simple ways to clear your clutter this year; nextElementSibling: <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>
element : Staying home on New Year's Eve? Great ideas to celebrate at home; nextElementSibling: <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>
element : Here's how to set a functional Christmas table; nextElementSibling: null

结果是：

Stuff/myfirstbranch
Stuff/mysecondbranch

也许您使用了错误的JSoup版本？以上版本以1.8.3版本运行

Jsoup tagName（）提供了错误的标记

3 个答案: