我有String
,其中包含电子邮件的部分内容,我想从此String
中删除所有HTML编码。
这是我目前的代码:
public static String html2text(String html) {
Document document = Jsoup.parse(html);
document = new Cleaner(Whitelist.basic()).clean(document);
document.outputSettings().escapeMode(EscapeMode.xhtml);
document.outputSettings().charset("UTF-8");
html = document.body().html();
html = html.replaceAll("<br />", "");
splittedStr = html.split("Geachte heer/mevrouw,");
html = splittedStr[1];
html = "Geachte heer/mevrouw,"+html;
return html;
}
此方法删除所有HTML,保留行和大部分布局。但它也会返回一些&
和nbsp;
标记,这些标记未完全删除。请参阅下面的输出,因为您可以看到String
中仍然有一些标签甚至是部分标签。我如何摆脱这些?
Loonheffingen &n= bsp; Naam
nr in administratie &nbs= p; meldingen
nummer
1 &n= bsp; = ; 0 &= nbsp; &nbs= p; 1
123456789L01
修改
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">De afgekeurde meldingen zijn opgenomen in de bijlage: Afgekeurde meldingen.</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Wilt u zo spoedig mogelijk zorgdragen dat deze</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">meldingen gecorrigeerd worden aangeleverd?</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">mer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">Volg Aantal verwerkt Aantal afgekeurde</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> Loonheffingen Naam</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">nr in administratie meldingen</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"> nummer</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
<br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif"><span style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">1 0 1</span><br style="color:rgb(34,34,34);font-size:13px;font-family:arial,sans-serif">
这是我要解析的HTML的一部分。我想删除所有HTML,但保留原始电子邮件的布局。
感谢任何帮助,
谢谢!
解决
Document xmlDoc = Jsoup.parse(file, "", Parser.xmlParser());
Elements spans= xmlDoc.select("span");
for (Element link : spans) {
String html = textPlus(link);
System.out.println(html);
}
public static String textPlus(Element elem) {
List<TextNode> textNodes = elem.textNodes();
if (textNodes.isEmpty()) {
return "";
}
StringBuilder result = new StringBuilder();
// start at the first text node
Node currentNode = textNodes.get(0);
while (currentNode != null) {
// append deep text of all subsequent nodes
if (currentNode instanceof TextNode) {
TextNode currentText = (TextNode) currentNode;
result.append(currentText.text());
} else if (currentNode instanceof Element) {
Element currentElement = (Element) currentNode;
result.append(currentElement.text());
}
currentNode = currentNode.nextSibling();
}
return result.toString();
}
代码是this问题的答案。
答案 0 :(得分:1)
您需要遍历JSoup返回的HTML结构并整理文本节点,而不是这样做。这样你就可以让JSoup确定真正的文本,并为你处理实体编码(例如&
- &gt; &
等。)
有关详细信息,请参阅this SO question。