我将从头开始,有这样的模式的html:
<div id="post_message_(some numeric id)">
<div style="some style things">
<div class="smallfont" style="some style">useless text</div>
<table cellpading="6" cellspaceing=.......> a lot of text inside i dont need</table>
</div>
Text i need
</div>
那些带有样式的div和那个表是可选的,有时只有
<div id="post">
Text i need
</div>
我想将该文本解析为String。这里;我正在使用的代码
Elements divsInside = element.getElementById("post_message_" + id).getElementsByTag("div");
for(Element div : divsInside) {
if(div != null && div.attr("style").equals("margin:20px; margin-top:5px; ")) {
System.out.println(div.html());
div.remove();
System.out.println("div removed");
}
}
我添加了那些打印行以检查它是否找到它们,是的,它确实找到了正确的打印行,但后来当我将其解析为String时:
String message = Jsoup.parse(divsInside.html().replaceAll("(?i)<br[^>]*>", "br2n")).text()
.replaceAll("br2n", "\n");
由于某些原因,String包含所有删除的内容。
我尝试通过迭代器删除它们,或者通过索引完全删除元素,但结果是相同的。
答案 0 :(得分:1)
所以你想获得Text i need
。使用Element
的{{1}} ownText()
方法Gets the text owned by this element only; does not get the combined text of all children
。
private static void test(String htmlFile) {
File input = null;
Document doc = null;
Element specificIdDiv = null;
try {
input = new File(htmlFile);
doc = Jsoup.parse(input, "ASCII", "");
doc.outputSettings().charset("ASCII");
doc.outputSettings().escapeMode(EscapeMode.base);
/** Get Element id = post_message_1 **/
specificIdDiv = doc.getElementById("post_message_1");
if (specificIdDiv != null ) {
System.out.println("content: " + specificIdDiv.ownText());
}
} catch (Exception e) {
e.printStackTrace();
}
}