我可以使用JEditorPane来解析rtf文本并将其转换为html。但是html输出缺少某种格式,即本例中的删除标记。正如您在输出中看到的,下划线文本已正确包装在< u>中。但没有透视包装。有什么想法吗?
public void testRtfToHtml()
{
JEditorPane pane = new JEditorPane();
pane.setContentType("text/rtf");
StyledEditorKit kitRtf = (StyledEditorKit) pane.getEditorKitForContentType("text/rtf");
try
{
kitRtf.read(
new StringReader(
"{\\rtf1\\ansi \\deflang1033\\deff0{\\fonttbl {\\f0\\froman \\fcharset0 \\fprq2 Times New Roman;}}{\\colortbl;\\red0\\green0\\blue0;} {\\stylesheet{\\fs20 \\snext0 Normal;}} {\\plain \\fs26 \\strike\\fs26 This is supposed to be strike-through.}{\\plain \\fs26 \\fs26 } {\\plain \\fs26 \\ul\\fs26 Underline text here} {\\plain \\fs26 \\fs26 .{\\u698\\'20}}"),
pane.getDocument(), 0);
kitRtf = null;
StyledEditorKit kitHtml =
(StyledEditorKit) pane.getEditorKitForContentType("text/html");
Writer writer = new StringWriter();
kitHtml.write(writer, pane.getDocument(), 0, pane.getDocument().getLength());
System.out.println(writer.toString());
}
catch (Exception e)
{
e.printStackTrace();
}
}
输出:
<html>
<head>
<style>
<!--
p.Normal {
RightIndent:0.0;
FirstLineIndent:0.0;
LeftIndent:0.0;
}
-->
</style>
</head>
<body>
<p class=default>
<span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
This is supposed to be strike-through.
</span>
<span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
</span>
<span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
<u>Underline text here</u>
</span>
<span style="color: #000000; font-size: 13pt; font-family: Times New Roman">
.?
</span>
</p>
</body>
</html>
答案 0 :(得分:2)
您可以尝试按照this converter library
所述in this blog post使用OpenOffice或LibreOffice进行转换答案 1 :(得分:0)
这是我用来将RTF从.msg正文转换为HTML的函数。 请参阅我在GitHub上的Outlook消息解析器yamp存储库。
public static String rtfToHtml(String rtfText) {
if (rtfText != null) {
rtfText = rtfText.replaceAll("\\{\\\\\\*\\\\[m]?htmltag[\\d]*(.*)}", "$1")
.replaceAll("\\\\htmlrtf[1]?(.*)\\\\htmlrtf0", "")
.replaceAll("\\\\htmlrtf[01]?", "")
.replaceAll("\\\\htmlbase", "")
.replaceAll("\\\\par", "\n")
.replaceAll("\\\\tab", "\t")
.replaceAll("\\\\line", "\n")
.replaceAll("\\\\page", "\n\n")
.replaceAll("\\\\sect", "\n\n")
.replaceAll("\\\\emdash", "ߞ")
.replaceAll("\\\\endash", "ߝ")
.replaceAll("\\\\emspace", "ߓ")
.replaceAll("\\\\enspace", "ߒ")
.replaceAll("\\\\qmspace", "ߕ")
.replaceAll("\\\\bullet", "ߦ")
.replaceAll("\\\\lquote", "ߢ")
.replaceAll("\\\\rquote", "ߣ")
.replaceAll("\\\\ldblquote", "ÉC;")
.replaceAll("\\\\rdblquote", "ÉD;")
.replaceAll("\\\\row", "\n")
.replaceAll("\\\\cell", "|")
.replaceAll("\\\\nestcell", "|")
.replaceAll("([^\\\\])\\{", "$1")
.replaceAll("([^\\\\])}", "$1")
.replaceAll("[\\\\](\\{)", "$1")
.replaceAll("[\\\\](})", "$1")
.replaceAll("\\\\u([0-9]{2,5})", "&#$1;")
.replaceAll("\\\\'([0-9A-Fa-f]{2})", "&#x$1;")
.replaceAll("\"cid:(.*)@.*\"", "\"$1\"");
int index = rtfText.indexOf("<html");
if (index != -1) {
return rtfText.substring(index);
}
}
return null;
}