使用jSoup格式化html的文本输出

时间:2013-10-28 20:36:18

标签: java html-parsing jsoup text-parsing

我有一个我要解析的文档包含html,我想从html转换为plaintext但是要格式化。

示例摘录

<p>My simple paragragh</p>
<p>My paragragh with <a>Link</a></p>
<p>My paragragh with an <img/></p>

我可以通过做(可能不是很有效)来轻松地做这个简单的例子

StringBuilder sb = new StringBuilder();

for(Element element : doc.getAllElements()){
    if(element.tag().getName().equals("p")){
        sb.append(element.text());
        sb.append("\n\n");
    }
}

是否可以(以及如何操作)在正确的位置插入内联元素的输出。一个例子:

<p>My paragragh with <a>Link</a> in the middle</p> 

会变成:

My paragragh with (Location: http://mylink.com) in the middle

1 个答案:

答案 0 :(得分:1)

您可以使用TextNode替换每个链接标记:

final String html = "<p>My simple paragragh</p>\n"
        + "<p>My paragragh with <a>Link</a></p>\n"
        + "<p>My paragragh with an <img/></p>";

Document doc = Jsoup.parse(html, "");

// Select all link-tags and replace them with TextNodes
for( Element element : doc.select("a") )
{
    element.replaceWith(new TextNode("(Location: http://mylink.com)", ""));
}


StringBuilder sb = new StringBuilder();

// Format as needed
for( Element element : doc.select("*") )
{
    // An alternative to the 'if'-statement
    switch(element.tagName())
    {
        case "p":
            sb.append(element.text()).append("\n\n");
            break;
        // Maybe you have to format some other tags here too ...
    }
}

System.out.println(sb);