使用JSoup检索p标记之间的所有html

时间:2012-11-11 10:33:11

标签: java web-scraping jsoup

所以这是场景。我有一个大的html文件,我想用JSoup刮。我是新手,我一直在阅读一些教程和API参考资料。我有以下html块。

<p><a name="bob"></a>
<table class='schedules'>
<tr><td  align='center' colspan="5"><b>Bob the Builder</b><br>
<a href="blah blah" class='tiny'>Blah Blah Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><!--<td class='whoohaa'><a href="random/randomUrl.htm">Blah</a></td>--><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td></tr>
</table>
</p>

现在有更多的这些块遵循类似的模式,其中(在第一行中)name属性发生变化(从“bob”变为其他东西)。我想要做的是首先能够选择“bob”p块,然后检索所有html,直到最后一行中的终止p块。

我尝试过以下方法:

Elements innerStuff = doc.select("a:contains(bob) ~ *");

但它只给了我与href atrributes的链接,我想这是预期的。但是,我很难看到我还能解决这个问题吗?

非常感谢您在这方面的帮助。

1 个答案:

答案 0 :(得分:1)

基于名称属性选择标签的更简单方法是:

doc.select("a[name=bob]")

从那里,您应该能够使用parent()导航到您想要的元素(例如,获取包含链接的p标签)(您需要先调用first()才能获得第一个(和only)匹配选择器的元素:

doc.select("a[name=bob]").first().parent()

但有一个问题:解析后的HTML文档与原始HTML不同: 这是原始的HTML结构:

p
    a[name=bob]
    table
        ...

以下是解析后的HTML的样子:

p
    a[name=bob]
table
    ...
p

所以,从link标签开始,要获得该表的元素,你需要上升一级(到段落)并抓住下一个元素:

doc.select("a[name=bob]").first().parent().nextElementSibling()