我的html文档中有这个结构:
<p>
"<em>You</em> began the evening well, Charlotte," said Mrs. Bennet with civil self–command to Miss Lucas. "<em>You</em> were Mr. Bingley's first choice."
</p>
但是我需要将我的“纯文本”包含在标签中,以便能够处理它:)
<p>
<text>"</text>
<em>You</em>
<text> began the evening well, Charlotte," said Mrs. Bennet with civil self–command to Miss Lucas. "</text>
<em>You</em>
<text> were Mr. Bingley's first choice."</text>
</p>
任何想法如何实现这一目标?我看过tagsoup和jsoup,但我似乎不太容易解决这个问题。也许使用一些花哨的正则表达式。
由于
答案 0 :(得分:5)
这是一个建议:
public static Node toTextElement(String str) {
Element e = new Element(Tag.valueOf("text"), "");
e.appendText(str);
return e;
}
public static void replaceTextNodes(Node root) {
if (root instanceof TextNode)
root.replaceWith(toTextElement(((TextNode) root).text()));
else
for (Node child : root.childNodes())
replaceTextNodes(child);
}
测试代码:
String html = "<p>\"<em>You</em> began the evening well, Charlotte,\" " +
"said Mrs. Bennet with civil self–command to Miss Lucas." +
" \"<em>You</em> were Mr. Bingley's first choice.\"</p>";
Document doc = Jsoup.parse(html);
for (Node n : doc.body().children())
replaceTextNodes(n);
System.out.println(doc);
<强>输出:强>
<html>
<head></head>
<body>
<p>
<text>
"
</text><em>
<text>
You
</text></em>
<text>
began the evening well, Charlotte," said Mrs. Bennet with civil self–command to Miss Lucas. "
</text><em>
<text>
You
</text></em>
<text>
were Mr. Bingley's first choice."
</text></p>
</body>
</html>