有一个DOC文件包含一些图像。如何使用图像将其转换为HTML?
我尝试使用这个例子: Convert Word doc to HTML programmatically in Java
public class Converter {
...
private File docFile, htmlFile;
try {
FileInputStream fos = new FileInputStream(docFile.getAbsolutePath());
HWPFDocument doc = new HWPFDocument(fos);
Document newDoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDoc) ;
wordToHtmlConverter.processDocument(doc);
StringWriter stringWriter = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(
new DOMSource(wordToHtmlConverter.getDocument()),
new StreamResult(stringWriter)
);
String html = stringWriter.toString();
try {
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream(htmlFile), "UTF-8")
);
out.write(html);
out.close();
} catch (IOException e) {
e.printStackTrace();
}
JEditorPane jEditorPane = new JEditorPane();
jEditorPane.setContentType("text/html");
jEditorPane.setEditable(false);
jEditorPane.setPage(htmlFile.toURI().toURL());
JScrollPane jScrollPane = new JScrollPane(jEditorPane);
JFrame jFrame = new JFrame("display html file");
jFrame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
jFrame.getContentPane().add(jScrollPane);
jFrame.setSize(512, 342);
jFrame.setVisible(true);
} catch(Exception e) {
e.printStackTrace();
}
...
}
但是图像丢失了。
WordToHtmlConverter
课程的documentation说明如下:
...此实现不会创建图像或指向它们的链接。这个 可以通过覆盖来改变
AbstractWordConverter.processImage(Element, boolean, Picture)
方法
如何将DOC转换为带图像的HTML?
答案 0 :(得分:3)
在这种情况下,最好的选择是使用Apache Tika,并让它为您包装Apache POI。 Apache Tika将为您的文档生成HTML(或纯文本,但您希望HTML适合您的情况)。除此之外,它还将为嵌入式资源提供占位符,为嵌入式图像添加img标签,并为您提供获取嵌入式资源和图像内容的方法。
在Alfresco HTMLRenderingEngine中有一个非常好的例子。您可能希望查看那里的代码,然后编写自己的代码来执行非常相似的操作。那里的代码包括一个自定义ContentHandler,它允许编辑img标签,重新编写src属性,你可能需要也可能不需要它,具体取决于你要写出图像的位置。
答案 1 :(得分:3)
扩展WordToHtmlConverter并覆盖processImageWithoutPicturesManager 。
import java.util.Base64;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.usermodel.Picture;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
public class InlineImageWordToHtmlConverter extends WordToHtmlConverter {
public InlineImageWordToHtmlConverter(Document document) {
super(document);
}
@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
boolean inlined, Picture picture)
{
Element imgNode = currentBlock.getOwnerDocument().createElement("img");
StringBuilder sb = new StringBuilder();
sb.append(Base64.getMimeEncoder().encodeToString(picture.getRawContent()));
sb.insert(0, "data:"+picture.getMimeType()+";base64,");
imgNode.setAttribute("src", sb.toString());
currentBlock.appendChild(imgNode);
}
}
在解析文档时使用新类,如下所示
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("D:/temp/Temp.doc"));
WordToHtmlConverter wordToHtmlConverter = new InlineImageWordToHtmlConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder()
.newDocument());
wordToHtmlConverter.processDocument(wordDocument);