Jsoup tagName()提供了错误的标记

时间:2016-02-05 05:56:45

标签: java html-parsing jsoup

我有以下HTML:

    <p>                         
     <a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>   
    <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>   
    <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>   
    <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>    
    </p>                        

此部分来自网页http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861

一段代码:

    Document document = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get(); 
    String tag = null;
    for (Element element : document.select("*") ) { 
        tag = element.tagName();

        if ( "a".equalsIgnoreCase( tag ) ) {
            LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
        }


        if ( StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah") ) {
            LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
            LOGGER.info("tag : {}; nextNodeSibling: {}", tag, element.nextSibling() );
            LOGGER.info("element : {}; previousElementSibling: {}", element.ownText(), element.previousElementSibling() );
        }

}

我得到的输出:

    element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: null
    tag : h2; nextNodeSibling:  
    element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null

有很多问题:

  1. 从主HTML源代码中有许多标记为a的元素,但没有来自我正在检查的小HTML部分
  2. 似乎<a>被捕获为<h2>
  3. element.nextElementSibling()在大多数情况下为空
  4. 然而,如果仅针对小片进行测试,问题就会消失。因此,当Jsoup出现在更大的HTML源代码中时,它似乎无法正确识别标签。

    知道为什么吗?

    感谢。

    编辑2

    练习背后的意图是清理网页。这就是我遍历整个HTML的原因,而不是@Stephan建议的特定部分。我只挑选了一个看起来有问题的特定部分。

    但在检查了@luksch的回复之后,我重新查看了原始的HTML,发现了拍摄中的异常现象。代码全部查看所有标记,但是a例外。在主要来源中,我们article后跟afigure(其中包含iimgimgsmallsmall),h2。问题似乎是所有标记(a除外)都被删除(根据需要起作用)但是text被遗忘了。这就是为什么我最终被遗留在原始HTML源代码中的<a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>

    Jill Martin从她的客房里拯救Savannah Guthrie 是来自<h2>的文字,但是<h2>被删除并留下了文字。有趣的是,Jsoup仍然认为文本来自h2,尽管最终输出没有h2

3 个答案:

答案 0 :(得分:1)

您提供的网址包含以下元素:

<a class="player-tease-link" href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959">
<figure class="player-tease">
  <i class="player-tease-icon icon-video-play"></i>
  <img class="tease-icon-play" src="http://nodeassets.today.com/img/svg/641a740d.video-play-white.svg" alt="Play">
  <img class="tease-image" src="http://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/tdy_guth_clutter_160120.today-vid-post-small-desktop.jpg" title="Jill Martin rescues Savannah Guthrie from her guest room mess" alt="Jill Martin rescues Savannah Guthrie from her guest room mess">
  <small class="tease-sponsored">Sponsored Content</small>
  <small class="tease-playing">Now Playing</small>
</figure>
<h2 class="player-tease-headline">Jill Martin rescues Savannah Guthrie from her guest room mess</h2>
</a>

因此,您似乎确实将苹果与橙子进行了比较,这意味着您提供的html代码段不是原始HTML的一部分。我猜你使用了一些已经改变HTML的提取工具。请注意,a元素不包含任何自己的文本!

一个好主意是遵循@Stephan的建议并学习如何使用CSS selectors properly。这应该比选择all然后在程序代码中手动过滤更有效。以下是您可以做的一个示例:

 Elements interestingAs = document.select("a:matches(^Jill Martin)");

这将选择包含以&#34开头的文字的所有a元素; Jill Martin&#34;。

答案 1 :(得分:0)

我认为选择器需要更具体。

而不是document.select("*"),请尝试document.select("a")

答案 2 :(得分:0)

这对我来说是不可复制的。以下程序准确打印出您期望的内容:

element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>
tag : a; nextNodeSibling:  
element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null
element : 4 simple ways to clear your clutter this year; nextElementSibling: <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>
element : Staying home on New Year's Eve? Great ideas to celebrate at home; nextElementSibling: <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>
element : Here's how to set a functional Christmas table; nextElementSibling: null

结果是:

Stuff/myfirstbranch
Stuff/mysecondbranch

也许您使用了错误的JSoup版本?以上版本以1.8.3版本运行