我有一个我要解析的文档包含html,我想从html
转换为plaintext
但是要格式化。
示例摘录
<p>My simple paragragh</p>
<p>My paragragh with <a>Link</a></p>
<p>My paragragh with an <img/></p>
我可以通过做(可能不是很有效)来轻松地做这个简单的例子
StringBuilder sb = new StringBuilder();
for(Element element : doc.getAllElements()){
if(element.tag().getName().equals("p")){
sb.append(element.text());
sb.append("\n\n");
}
}
是否可以(以及如何操作)在正确的位置插入内联元素的输出。一个例子:
<p>My paragragh with <a>Link</a> in the middle</p>
会变成:
My paragragh with (Location: http://mylink.com) in the middle
答案 0 :(得分:1)
您可以使用TextNode
替换每个链接标记:
final String html = "<p>My simple paragragh</p>\n"
+ "<p>My paragragh with <a>Link</a></p>\n"
+ "<p>My paragragh with an <img/></p>";
Document doc = Jsoup.parse(html, "");
// Select all link-tags and replace them with TextNodes
for( Element element : doc.select("a") )
{
element.replaceWith(new TextNode("(Location: http://mylink.com)", ""));
}
StringBuilder sb = new StringBuilder();
// Format as needed
for( Element element : doc.select("*") )
{
// An alternative to the 'if'-statement
switch(element.tagName())
{
case "p":
sb.append(element.text()).append("\n\n");
break;
// Maybe you have to format some other tags here too ...
}
}
System.out.println(sb);