使用apache poi转换器编码问题

时间:2017-01-24 13:48:02

标签: encoding apache-poi converter

我有一个ms word doc文件,我正在使用apache poi转换为html文档。

这是我正在运行的代码

    InputStream input = new FileInputStream (path);
    HWPFDocument wordDocument = new HWPFDocument (input);            
    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter (DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument() );

    List<Picture> pics = wordDocument.getPicturesTable().getAllPictures();
    if (pics != null) 
    {
        for (int i = 0; i <pics.size(); i++) 
        {
            Picture pic = (Picture) pics.get (i);
            try 
            {
                pic.writeImageContent (new FileOutputStream (path + pic.hashCode() + '.' + pic.suggestFileExtension()) );
            }
            catch (FileNotFoundException e) 
            {
                e.printStackTrace();
            }
        }
    }

    wordToHtmlConverter.setPicturesManager (new PicturesManager() 
    {               
        public String savePicture (byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) 
        {
            for(Picture picName:pics)
            {
                return Integer.toString(picName.hashCode()) + '.' + picName.suggestFileExtension();
            }

            return null;
        }
    });

    wordToHtmlConverter.processDocument(wordDocument);                       
    Document htmlDocument = wordToHtmlConverter.getDocument();                        
    ByteArrayOutputStream outStream = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult (outStream);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty (OutputKeys.ENCODING, "gbk");
    serializer.setOutputProperty (OutputKeys.INDENT, "yes");
    serializer.setOutputProperty (OutputKeys.METHOD, "html");
    serializer.transform (domSource, streamResult);
    outStream.close();

    String html = new String (outStream.toByteArray());

代码工作正常,它保留了图像和样式。但是,html中的某些字符似乎存在问题,而且编码不正确。例如,原始.doc文件中的某些项目符号样式未正确输出。我已经尝试过多个字符集(ASCII,UTF-8,gbk ......)都没有正确生成项目符号。

我百分之九十九肯定子弹由于编码而显示出乱码。有没有人用apache遇到这样的问题?

2 个答案:

答案 0 :(得分:1)

这不是编码问题,而是字体问题。 Word使用ANSI代码和特殊字体作为其默认项目符号列表。例如,第一个项目符号是来自font&#34; Symbol&#34;的子弹。第二个项目符号点是字体&#34; Courier New&#34;中的圆圈,第三个项目符号点是字体&#34; Wingdings&#34;的正方形。

所以最简单的可能性就是用unicode替换子弹文本的ANSI代码。这样我们就可以使用UTF-8作为HTML。

示例:

Word WordBulletList.doc

enter image description here

爪哇:

import java.io.StringWriter;
import java.io.FileInputStream;
import java.io.File;
import java.io.PrintWriter;

import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import javax.xml.parsers.DocumentBuilderFactory;

import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.usermodel.Paragraph;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.FontReplacer;
import org.apache.poi.hwpf.converter.FontReplacer.Triplet;

import org.w3c.dom.Document;

import java.awt.Desktop;

public class TestWordToHtmlConverter {

 public static void main(String[] args) throws Exception {

  Document newDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();

  WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDocument) {

   protected void processParagraph(HWPFDocumentCore hwpfDocument, 
                                   org.w3c.dom.Element parentElement, 
                                   int currentTableLevel, 
                                   Paragraph paragraph, 
                                   java.lang.String bulletText) {
    if (bulletText!="") {
     //System.out.println((int)bulletText.charAt(0));
     bulletText = bulletText.replace("\uF0B7", "\u2022");
     bulletText = bulletText.replace("\u006F", "\u00A0\u00A0\u26AA");
     bulletText = bulletText.replace("\uF0A7", "\u00A0\u00A0\u00A0\u00A0\u25AA");
    }

    super.processParagraph(hwpfDocument, parentElement, currentTableLevel, paragraph, bulletText);
   }

  };

  wordToHtmlConverter.processDocument(new HWPFDocument(new FileInputStream("WordBulletList.doc")));

  StringWriter stringWriter = new StringWriter();
  Transformer transformer = TransformerFactory.newInstance().newTransformer();
  transformer.setOutputProperty( OutputKeys.INDENT, "yes" );
  transformer.setOutputProperty( OutputKeys.ENCODING, "utf-8" );
  transformer.setOutputProperty( OutputKeys.METHOD, "html" );
  transformer.transform(new DOMSource(wordToHtmlConverter.getDocument()), new StreamResult(stringWriter));

  String html = stringWriter.toString();

  try(PrintWriter out = new PrintWriter("WordBulletList.html")) {
    out.println(html);
  }

  File htmlFile = new File("WordBulletList.html");
  Desktop.getDesktop().browse(htmlFile.toURI());

 }
}

HTML:

...
<body class="b1 b2">
<p class="p1">
<span>Word bullet list:</span>
</p>
<p class="p2">
<span class="s1">&bull;​&nbsp;</span><span>Bullet1</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪​&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;&nbsp;&nbsp;▪​&nbsp;</span><span>Bullet3</span>
</p>
<p class="p2">
<span class="s1">&nbsp;&nbsp;⚪​&nbsp;</span><span>Bullet2</span>
</p>
<p class="p2">
<span class="s1">&bull;​&nbsp;</span><span>Bullet1</span>
</p>
<p class="p1">
<span>End</span>
</p>
</body>
...

答案 1 :(得分:0)

已解决的问题

我终于找到了解决这个特殊问题的方法。答案的灵感来自于@ pawelini1,他有自己的问题Encoding issue with Apache POI

解决方案很简单,我所做的就是在我的html字符串上使用URLEncoder / Decoder

String html = URLEncoder.encode(new String(outStream.toByteArray(), "UTF-8"), "UTF-8");
String decoded = URLDecoder.decode(html, "UTF-8");

现在我的网页显示正常。