我有以下HTML:
<p>
<a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>
<a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>
<a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>
<a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>
</p>
此部分来自网页http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861
一段代码:
Document document = Jsoup.connect("http://www.today.com/home/decorating-ideas-david-bromstad-shares-tips-living-luxury-less-t70861").get();
String tag = null;
for (Element element : document.select("*") ) {
tag = element.tagName();
if ( "a".equalsIgnoreCase( tag ) ) {
LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
}
if ( StringUtils.containsIgnoreCase(element.ownText(), "Jill Martin rescues Savannah") ) {
LOGGER.info("element : {}; nextElementSibling: {}", element.ownText(), element.nextElementSibling() );
LOGGER.info("tag : {}; nextNodeSibling: {}", tag, element.nextSibling() );
LOGGER.info("element : {}; previousElementSibling: {}", element.ownText(), element.previousElementSibling() );
}
}
我得到的输出:
element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: null
tag : h2; nextNodeSibling:
element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null
有很多问题:
a
的元素,但没有来自我正在检查的小HTML部分<a>
被捕获为<h2>
element.nextElementSibling()
在大多数情况下为空然而,如果仅针对小片进行测试,问题就会消失。因此,当Jsoup出现在更大的HTML源代码中时,它似乎无法正确识别标签。
知道为什么吗?
感谢。
编辑2
练习背后的意图是清理网页。这就是我遍历整个HTML的原因,而不是@Stephan建议的特定部分。我只挑选了一个看起来有问题的特定部分。
但在检查了@luksch的回复之后,我重新查看了原始的HTML,发现了拍摄中的异常现象。代码全部查看所有标记,但是a
例外。在主要来源中,我们article
后跟a
,figure
(其中包含i
,img
,img
,small
,small
),h2
。问题似乎是所有标记(a
除外)都被删除(根据需要起作用)但是text
被遗忘了。这就是为什么我最终被遗留在原始HTML源代码中的<a href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959" rel="nofollow"> Jill Martin rescues Savannah Guthrie from her guest room mess </a>
。
Jill Martin从她的客房里拯救Savannah Guthrie 是来自<h2>
的文字,但是<h2>
被删除并留下了文字。有趣的是,Jsoup仍然认为文本来自h2
,尽管最终输出没有h2
。
答案 0 :(得分:1)
您提供的网址包含以下元素:
<a class="player-tease-link" href="http://www.today.com/video/jill-martin-rescues-savannah-guthrie-from-her-guest-room-mess-604921923959">
<figure class="player-tease">
<i class="player-tease-icon icon-video-play"></i>
<img class="tease-icon-play" src="http://nodeassets.today.com/img/svg/641a740d.video-play-white.svg" alt="Play">
<img class="tease-image" src="http://media1.s-nbcnews.com/j/MSNBC/Components/Video/__NEW/tdy_guth_clutter_160120.today-vid-post-small-desktop.jpg" title="Jill Martin rescues Savannah Guthrie from her guest room mess" alt="Jill Martin rescues Savannah Guthrie from her guest room mess">
<small class="tease-sponsored">Sponsored Content</small>
<small class="tease-playing">Now Playing</small>
</figure>
<h2 class="player-tease-headline">Jill Martin rescues Savannah Guthrie from her guest room mess</h2>
</a>
因此,您似乎确实将苹果与橙子进行了比较,这意味着您提供的html代码段不是原始HTML的一部分。我猜你使用了一些已经改变HTML的提取工具。请注意,a
元素不包含任何自己的文本!
一个好主意是遵循@Stephan的建议并学习如何使用CSS selectors properly。这应该比选择all然后在程序代码中手动过滤更有效。以下是您可以做的一个示例:
Elements interestingAs = document.select("a:matches(^Jill Martin)");
这将选择包含以&#34开头的文字的所有a
元素; Jill Martin&#34;。
答案 1 :(得分:0)
我认为选择器需要更具体。
而不是document.select("*")
,请尝试document.select("a")
。
答案 2 :(得分:0)
这对我来说是不可复制的。以下程序准确打印出您期望的内容:
element : Jill Martin rescues Savannah Guthrie from her guest room mess; nextElementSibling: <a href="http://www.today.com/video/4-simple-ways-to-clear-your-clutter-this-year-596741699678" rel="nofollow"> 4 simple ways to clear your clutter this year </a>
tag : a; nextNodeSibling:
element : Jill Martin rescues Savannah Guthrie from her guest room mess; previousElementSibling: null
element : 4 simple ways to clear your clutter this year; nextElementSibling: <a href="http://www.today.com/video/staying-home-on-new-years-eve-great-ideas-to-celebrate-at-home-594027587814" rel="nofollow"> Staying home on New Year's Eve? Great ideas to celebrate at home </a>
element : Staying home on New Year's Eve? Great ideas to celebrate at home; nextElementSibling: <a href="http://www.today.com/video/heres-how-to-set-a-functional-christmas-table-591622211749" rel="nofollow"> Here's how to set a functional Christmas table </a>
element : Here's how to set a functional Christmas table; nextElementSibling: null
结果是:
Stuff/myfirstbranch
Stuff/mysecondbranch
也许您使用了错误的JSoup版本?以上版本以1.8.3版本运行