我有一个ms word doc文件,我正在使用apache poi转换为html文档。
这是我正在运行的代码
InputStream input = new FileInputStream (path);
HWPFDocument wordDocument = new HWPFDocument (input);
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter (DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument() );
List<Picture> pics = wordDocument.getPicturesTable().getAllPictures();
if (pics != null)
{
for (int i = 0; i <pics.size(); i++)
{
Picture pic = (Picture) pics.get (i);
try
{
pic.writeImageContent (new FileOutputStream (path + pic.hashCode() + '.' + pic.suggestFileExtension()) );
}
catch (FileNotFoundException e)
{
e.printStackTrace();
}
}
}
wordToHtmlConverter.setPicturesManager (new PicturesManager()
{
public String savePicture (byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches)
{
for(Picture picName:pics)
{
return Integer.toString(picName.hashCode()) + '.' + picName.suggestFileExtension();
}
return null;
}
});
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream outStream = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult (outStream);
TransformerFactory tf = TransformerFactory.newInstance();
Transformer serializer = tf.newTransformer();
serializer.setOutputProperty (OutputKeys.ENCODING, "gbk");
serializer.setOutputProperty (OutputKeys.INDENT, "yes");
serializer.setOutputProperty (OutputKeys.METHOD, "html");
serializer.transform (domSource, streamResult);
outStream.close();
String html = new String (outStream.toByteArray());
代码工作正常,它保留了图像和样式。但是,html中的某些字符似乎存在问题,而且编码不正确。例如,原始.doc文件中的某些项目符号样式未正确输出。我已经尝试过多个字符集(ASCII,UTF-8,gbk ......)都没有正确生成项目符号。
我百分之九十九肯定子弹由于编码而显示出乱码。有没有人用apache遇到这样的问题?
答案 0 :(得分:1)
这不是编码问题,而是字体问题。 Word
使用ANSI
代码和特殊字体作为其默认项目符号列表。例如,第一个项目符号是来自font&#34; Symbol&#34;的子弹。第二个项目符号点是字体&#34; Courier New&#34;中的圆圈,第三个项目符号点是字体&#34; Wingdings&#34;的正方形。
所以最简单的可能性就是用unicode替换子弹文本的ANSI
代码。这样我们就可以使用UTF-8作为HTML。
示例:
Word WordBulletList.doc
:
爪哇:
import java.io.StringWriter;
import java.io.FileInputStream;
import java.io.File;
import java.io.PrintWriter;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.parsers.DocumentBuilderFactory;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.FontReplacer;
import org.apache.poi.hwpf.converter.FontReplacer.Triplet;
import org.w3c.dom.Document;
import java.awt.Desktop;
public class TestWordToHtmlConverter {
public static void main(String[] args) throws Exception {
Document newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) {
protected void processParagraph(HWPFDocumentCore hwpfDocument,
org.w3c.dom.Element parentElement,
int currentTableLevel,
Paragraph paragraph,
java.lang.String bulletText) {
if (bulletText!="") {
//System.out.println((int)bulletText.charAt(0));
bulletText = bulletText.replace("\uF0B7", "\u2022");
bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");
bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");
}
super.processParagraph(hwpfDocument, parentElement, currentTableLevel, paragraph, bulletText);
}
};
wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));
StringWriter stringWriter = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
transformer.setOutputProperty( OutputKeys.METHOD, "html" );
transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter));
String html = stringWriter.toString();
try(PrintWriter out = new PrintWriter("WordBulletList.html")) {
out.println(html);
}
File htmlFile = new File("WordBulletList.html");
Desktop.getDesktop().browse(htmlFile.toURI());
}
}
HTML:
...
<body class="b1 b2">
<p class="p1">
<span>Word bullet list:</span>
</p>
<p class="p2">
<span class="s1">• </span><span>Bullet1</span>
</p>
<p class="p2">
<span class="s1"> ⚪ </span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1"> ▪ </span><span>Bullet3</span>
</p>
<p class="p2">
<span class="s1"> ⚪ </span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">• </span><span>Bullet1</span>
</p>
<p class="p1">
<span>End</span>
</p>
</body>
...
答案 1 :(得分:0)
已解决的问题
我终于找到了解决这个特殊问题的方法。答案的灵感来自于@ pawelini1,他有自己的问题Encoding issue with Apache POI
解决方案很简单,我所做的就是在我的html字符串上使用URLEncoder / Decoder
String html = URLEncoder.encode(new String(outStream.toByteArray(), "UTF-8"), "UTF-8");
String decoded = URLDecoder.decode(html, "UTF-8");
现在我的网页显示正常。