我有一个应用程序尝试从数据库中提取一些数据,然后将其保存在docx文件中。这些数据的某些部分是html代码,因此使用docx4j我能够将html代码转换为docx格式。相关帖子为here。
现在我想使用docx4j将此部分文本(在docx文件中的表格单元格中)转换为html格式,并将html代码保存到数据库中。
我从docx4j示例中包含了一些代码,并且代码如下:
public class AltChunkAddOfTypeHtml {
private static ObjectFactory factory;
private final static String inputfilepath = System.getProperty("user.dir")
+ "/test.docx";
public static void main(String[] args) throws Exception {
WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
.createPackage();
MainDocumentPart mdp = wordMLPackage.getMainDocumentPart();
factory = Context.getWmlObjectFactory();
Tbl table = factory.createTbl();
Tr tableRow = factory.createTr();
Tc tableCell = factory.createTc();
wordMLPackage.getMainDocumentPart().addObject(table);
String xhtml = "<html><head><title>Import me</title></head><body><p>Hello World!This is the html code converted into docx!!!</p><b>tested by david</b></body></html>";
;
mdp.addAltChunk(AltChunkType.Xhtml, xhtml.getBytes(), tableCell);
tableRow.getContent().add(tableCell);
table.getContent().add(tableRow);
// Round trip
wordMLPackage = mdp.convertAltChunks();
wordMLPackage.save(new java.io.File(inputfilepath));
List<Object> tableCells = getAllElementFromObject(
wordMLPackage.getMainDocumentPart(), Tc.class);
System.out.println(tableCells.size());
/* only one tc in wordMLPackage */
List<Object> paragraphsInTc = getAllElementFromObject(
tableCells.get(0), P.class);
System.out.println(paragraphsInTc.size());
System.out.println("Ready to create html.");
WordprocessingMLPackage wordMLPackage2 = WordprocessingMLPackage
.createPackage();
for (Object o : paragraphsInTc) {
wordMLPackage2.getMainDocumentPart().addObject(o);
}
HTMLSettings htmlSettings = Docx4J.createHTMLSettings();
htmlSettings.setWmlPackage(wordMLPackage2);
OutputStream os;
os = new FileOutputStream(new java.io.File(
System.getProperty("user.dir") + "/sample.html"));
System.out.println("Creating html.");
Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);
}
private static List<Object> getAllElementFromObject(Object obj,
Class<?> toSearch) {
List<Object> result = new ArrayList<Object>();
if (obj instanceof JAXBElement)
obj = ((JAXBElement<?>) obj).getValue();
if (obj.getClass().equals(toSearch))
result.add(obj);
else if (obj instanceof ContentAccessor) {
List<?> children = ((ContentAccessor) obj).getContent();
for (Object child : children) {
result.addAll(getAllElementFromObject(child, toSearch));
}
}
return result;
}
}
它对我有用,下面是我得到的HTML:
<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type" /><style><!--/*paged media */ div.header {display: none }div.footer {display: none } /*@media print { */@page { size: A4; margin: 10%; @top-center {content: element(header) } @bottom-center {content: element(footer) } }/*element styles*/ .del {text-decoration:line-through;color:red;} .ins {text-decoration:none;background:#c0ffc0;padding:1px;}
/* TABLE STYLES */
/* PARAGRAPH STYLES */
.DocDefaults {display:block;margin-bottom: 4mm;line-height: 115%;font-size: 11.0pt;}
.Normal {display:block;}
/* CHARACTER STYLES */ span.DefaultParagraphFont {display:inline;}
--></style><script type="text/javascript"><!--function toggleDiv(divid){if(document.getElementById(divid).style.display == 'none'){document.getElementById(divid).style.display = 'block';}else{document.getElementById(divid).style.display = 'none';}}
--></script></head><body>
<!-- userBodyTop goes here -->
<div style="color:red">TO HIDE THESE MESSAGES, TURN OFF debug level logging for org.docx4j.convert.out.common.writer.AbstractMessageWriter </div>
<div class="document">
<p class="Normal DocDefaults " style="text-align: left;position: relative; margin-left: 0in;margin-bottom: 0in;"><span class="DefaultParagraphFont " style="font-weight: normal;color: #000000;font-style: normal;font-size: 11.0pt;;font-family: Calibri;">Hello World!This is the html code converted into docx!!!</span></p>
<p class="Normal DocDefaults " style="text-align: left;position: relative; margin-left: 0in;margin-bottom: 0in;"><span class="DefaultParagraphFont " style="font-weight: bold;color: #000000;font-style: normal;font-size: 11.0pt;;font-family: Calibri;">tested by david</span></p></div>
<!-- userBodyTail goes here -->
</body></html>
由于我需要将这个html代码保存到数据库中,无论如何都要将转换后的html清理干净吗?就像它被导入docx之前一样?像这样:
<html><head><title>Import me</title></head><body><p>Hello World!This is the html code converted into docx!!!</p><b>tested by david</b></body></html>
或许有更好的解决方案来实现从docx到html的转换?希望我清楚自己。任何提示都表示赞赏。提前谢谢。
答案 0 :(得分:1)
通过阅读段落并从word运行解决,然后添加html标签。
/**
* Convert the description in table cell back into html code to be saved into database
*
* @param tc
* @return
*/
private String convertTcToHtml(Tc tc) {
StringBuilder sb = new StringBuilder();
sb.append("<html><body>");
List<Object> paragraphs = getAllElementFromObject(tc, P.class);
if (paragraphs == null || paragraphs.size() == 0) {
return null;
}
/* Description exported from alm only has one paragraph in word. */
List<Object> runs = getAllElementFromObject(paragraphs.get(0), R.class);
addRunsToHtmlStringBuffer(sb, runs);
/* If user modify description in word it may generate more paragraphs in word. */
if (paragraphs.size() > 1) {
sb.append("<br />");
for (int i = 1; i < paragraphs.size(); i++) {
List<Object> moreRuns = getAllElementFromObject(paragraphs.get(i), R.class);
addRunsToHtmlStringBuffer(sb, moreRuns);
/* Every paragraph should be starting from a new line */
sb.append("<br />");
}
}
sb.append("</body></html>");
return sb.toString();
}
/**
* Add Texts of a list of Runs to the html string builder
*
* @param sb
* @param runs
*/
private void addRunsToHtmlStringBuffer(StringBuilder sb, List<Object> runs) {
if (runs != null && runs.size() > 0) {
for (Object r : runs) {
R run = (R) r;
List<Object> brs = getAllElementFromObject(run, Br.class);
if (brs != null && brs.size() > 0) {
LOGGER.info("BR:");
sb.append("<br/>");
}
/* One run usually has one text */
List<Object> texts = getAllElementFromObject(run, Text.class);
if (texts != null && texts.size() > 0) {
StringBuilder text_sb = new StringBuilder();
for (Object t : texts) {
Text text = (Text) t;
text_sb.append(text.getValue());
}
String htmlText = replaceWithHtmlCharacters(text_sb.toString());
if (run.getRPr() != null && run.getRPr().getB() != null && (run.getRPr().getB().isVal())) {
LOGGER.info("Bold Text:");
sb.append("<b>");
sb.append(htmlText);
sb.append("</b>");
} else {
LOGGER.info("Normal Text:");
sb.append(htmlText);
}
}
}
}
}
/**
* Replace ", <, > with html special charactors
*
* @param text
* @return
*/
private String replaceWithHtmlCharacters(String text) {
text = text.replace("\"", """);
text = text.replace("<", "<");
text = text.replace(">", ">");
return text;
}