所以这是场景。我有一个大的html文件,我想用JSoup刮。我是新手,我一直在阅读一些教程和API参考资料。我有以下html块。
<p><a name="bob"></a>
<table class='schedules'>
<tr><td align='center' colspan="5"><b>Bob the Builder</b><br>
<a href="blah blah" class='tiny'>Blah Blah Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bk'><a href="random/randomUrl.htm">Blah</a></td><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='bm'><a href="random/randomUrl.htm">Blah</a></td><!--<td class='whoohaa'><a href="random/randomUrl.htm">Blah</a></td>--><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">blah</a></td><td class='cc'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">Blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">Blah</a></td></tr>
<tr><td class='sk'><a href="random/randomUrl.htm">Blah</a></td><td class='nm'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td><td class='sk'><a href="random/randomUrl.htm">blah</a></td></tr>
</table>
</p>
现在有更多的这些块遵循类似的模式,其中(在第一行中)name属性发生变化(从“bob”变为其他东西)。我想要做的是首先能够选择“bob”p块,然后检索所有html,直到最后一行中的终止p块。
我尝试过以下方法:
Elements innerStuff = doc.select("a:contains(bob) ~ *");
但它只给了我与href atrributes的链接,我想这是预期的。但是,我很难看到我还能解决这个问题吗?
非常感谢您在这方面的帮助。
答案 0 :(得分:1)
基于名称属性选择标签的更简单方法是:
doc.select("a[name=bob]")
从那里,您应该能够使用parent()导航到您想要的元素(例如,获取包含链接的p标签)(您需要先调用first()才能获得第一个(和only)匹配选择器的元素:
doc.select("a[name=bob]").first().parent()
但有一个问题:解析后的HTML文档与原始HTML不同: 这是原始的HTML结构:
p
a[name=bob]
table
...
以下是解析后的HTML的样子:
p
a[name=bob]
table
...
p
所以,从link标签开始,要获得该表的元素,你需要上升一级(到段落)并抓住下一个元素:
doc.select("a[name=bob]").first().parent().nextElementSibling()