我使用apache POI 4.0.0将.doc转换为.html。
private static String ProcessingDoc(File doc, String imagedir) throws IOException, ParserConfigurationException, TransformerConfigurationException, TransformerFactoryConfigurationError {
FileInputStream in = new FileInputStream(doc);
HWPFDocument doc_file = new HWPFDocument(in);
Document html_file = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
WordToHtmlConverter converter = new WordToHtmlConverter(html_file);
converter.setPicturesManager(new PicturesManager() {
@Override
public String savePicture(byte[] content, PictureType pictureType, String suggestedName, float widthInches,
float heightInches) {
File imgFile = new File(getParentDirectory(doc));
if(!imgFile.exists()){
imgFile.mkdirs();
}
try {
FileOutputStream out = new FileOutputStream(imagedir+"/" + suggestedName);
out.write(content);
out.close();
} catch (Exception e) {
e.printStackTrace();
}
return suggestedName;
}
});
converter.processDocument(doc_file);
StringWriter stringWriter = new StringWriter();
Transformer transformer;
transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
transformer.setOutputProperty( OutputKeys.METHOD, "html" );
try {
transformer.transform(
new DOMSource( converter.getDocument() ),
new StreamResult( stringWriter ) );
} catch (TransformerException e) {
e.printStackTrace();
}
return stringWriter.toString();
}
}
但是POI会创建一些不完整的html文件,并在文件的不同位置剪切。 它看起来像:
<some text of html document>
<tr class="r1">
<td class="td49">
<p class="p17"></p>
</td><td class="td50">
<p class="p17"></p>
</td><td class="td51">
其html文件的结尾。 转换过程中没有错误。
为什么我没有错误并且文件不完整?
感谢您的回答!