Question

我在使用Apache POI和File Mime Type时遇到问题。我使用文件模板（Microsoft Word DOCX）通过Apache Poi修改某些值。原始文件的MIME类型为“ application / vnd.openxmlformats-officedocument.wordprocessingml.document”（在linux中：file -i {filename}），但是我使用POI处理该文件并保存，然后再次得到“ application / octet” -stream”，我希望保留文件的原始mime类型。

我用HEX编辑器打开文件，原始文件和修改过的文件都具有相同的“幻数”（50 4B 03 04），但是文件大小不同，即使文本相同。那么有可能修复它吗？有人有同样的问题吗？我在LibreOffice中检查了它，并发现它具有与Apache POI相同的行为。

任何帮助，任何信息都将帮助。

Answer 1

正如您在评论中已经提到的那样，Apache POI如何重新排列Office Open XML ZIP包的类型会导致某些工具误解内容类型。 Office Open XML文件（*.docx，*.xlsx，*.pptx）是ZIP存档，但在某种程度上Microsoft Office是如何打包的存档必须是特殊的。我还没有发现到底是什么。

示例：

开始有一个Document.docx，其中包含一些简单的内容，这些内容已由Microsoft Word保存。

为此，file -i产生：

axel@arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i Document.docx Document.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary

现在运行该代码：

import java.io.FileOutputStream; import java.io.FileInputStream; import org.apache.poi.xwpf.usermodel.XWPFDocument; public class WordReadAndReWrite { public static void main(String[] args) throws Exception { String inFilePath = "Document.docx"; String outFilePath = "NewDocument.docx"; XWPFDocument doc = new XWPFDocument(new FileInputStream(inFilePath)); doc.createParagraph().createRun().setText("new text inserted"); FileOutputStream out = new FileOutputStream(outFilePath); doc.write(out); out.close(); doc.close(); } }

对于产生的NewDocument.docx，file -i产生：

axel@arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i NewDocument.docx NewDocument.docx: application/octet-stream; charset=binary

但是，如果我们在不使用Apache POI的ZipPackage的情况下执行相同的操作，而是使用以下代码使用FileSystem将XML从Office Open XML ZIP包中取出：

import java.nio.file.Files; import java.nio.file.FileSystems; import java.nio.file.FileSystem; import java.nio.file.Paths; import java.nio.file.Path; import java.nio.file.StandardCopyOption; import java.nio.file.StandardOpenOption; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.DocumentBuilder; import org.w3c.dom.Document; import org.w3c.dom.Node; import javax.xml.transform.TransformerFactory; import javax.xml.transform.Transformer; import javax.xml.transform.stream.StreamResult; import javax.xml.transform.dom.DOMSource; public class WordReadAndReWriteFileSystem { public static void main(String[] args) throws Exception { String inFilePath = "Document.docx"; String outFilePath = "NewDocument.docx"; FileSystem fileSystem = FileSystems.newFileSystem(Paths.get(inFilePath), null); Path wordDocumentXml = fileSystem.getPath("/word/document.xml"); DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); Document xmlDocument = documentBuilder.parse(Files.newInputStream(wordDocumentXml, StandardOpenOption.READ)); Node p = xmlDocument.createElement("w:p"); Node r = xmlDocument.createElement("w:r"); p.appendChild(r); Node t = xmlDocument.createElement("w:t"); r.appendChild(t); Node text = xmlDocument.createTextNode("new text inserted"); t.appendChild(text); Node body = xmlDocument.getElementsByTagName("w:body").item(0); Node sectPr = xmlDocument.getElementsByTagName("w:sectPr").item(0); body.insertBefore(p, sectPr); TransformerFactory transformerFactory = TransformerFactory.newInstance(); Transformer transformer = transformerFactory.newTransformer(); DOMSource domSource = new DOMSource(xmlDocument); Path tmpDoc = Files.createTempFile("wordDocument", "tmp"); tmpDoc.toFile().deleteOnExit(); StreamResult streamResult = new StreamResult(Files.newOutputStream(tmpDoc, StandardOpenOption.WRITE)); transformer.transform(domSource, streamResult); fileSystem.close(); Path tmpZip = Files.createTempFile("zipDocument", "tmp"); tmpZip.toFile().deleteOnExit(); Path path = Files.copy(Paths.get(inFilePath), tmpZip, StandardCopyOption.REPLACE_EXISTING); fileSystem = FileSystems.newFileSystem(path, null); wordDocumentXml = fileSystem.getPath("/word/document.xml"); Files.copy(tmpDoc, wordDocumentXml, StandardCopyOption.REPLACE_EXISTING); fileSystem.close(); Files.copy(tmpZip, Paths.get(outFilePath), StandardCopyOption.REPLACE_EXISTING); } }

然后对于生成的NewDocument.docx，file -i产生：

axel@arichter:~/Dokumente/JAVA/poi/poi-4.0.1$ file -i NewDocument.docx NewDocument.docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document; charset=binary

Answer 2

此代码显示了我测试的所有文件的正确的MIME类型：

public static void main(String[] args) {
    String fileName = "model_libreoffice.docx";
//        String fileName = "model_poi.docx";
//        String fileName = "model_msoffice.docx";
//        String fileName = "model_repacked_bz2.docx";

    try {
        InputStream is = Main.class.getResourceAsStream("/" + fileName);
        Tika t = new Tika();
        String mime = t.detect(is, fileName);
        System.out.println("----> "  + mime);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

经过长时间的调试和测试，我认为这是文件第三方验证的问题。这个简单的代码向我显示了我尝试过的所有文件的正确的mime类型，并由MicrosoftOffice，LibreOffice，Apache Poi，Unzip和Zipping再次修改（重命名为DOCX）。

所以我认为这个问题可以完全标记为“已解决”。

Apache POI，更改文件Mime类型。有可能解决吗？

2 个答案: