Question

标题可能有点令人困惑。最简单的方法必须通过扩展名来判断，如：

// is represents the InputStream   
if (filePath.endsWith("doc")) {
    WordExtractor ex = new WordExtractor(is);
    text = ex.getText();
    ex.close();
} else if(filePath.endsWith("docx")) {
    XWPFDocument doc = new XWPFDocument(is);
    XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
    text = extractor.getText();
    extractor.close();
}

这在大多数情况下都适用。但我发现对于某些文件的扩展名为doc（本质上是docx文件），如果使用winrar打开，则会找到xml个文件。众所周知，docx文件是zip文件，包含xml个文件。我相信这个问题绝非罕见。但我没有找到任何有关此事的信息。显然，根据扩展名来判断阅读doc或docx是不合适的。

就我而言，我必须阅读很多文件。我甚至会在压缩文件doc，docx甚至zip中阅读7z或rar。因此，我必须通过inputStream而不是File或其他东西来阅读内容。所以how to know whether a file is .docx or .doc format from Apache POI完全不适合ZipInputStream的情况。

判断文件的最佳方法是doc还是docx？我想要一个解决方案来读取文件中的内容，该文件可能是doc或docx。但不仅仅是简单地判断它是doc还是docx。显然，ZipInpuStream对我的案例来说不是一个好方法。而且我认为这对其他人来说也不合适。为什么我必须通过例外来判断文件是doc还是docx？

Answer 1

使用当前稳定的apache poi版本3.17，您可以使用FileMagic。但是internally这当然也会看进入文件。

示例：

import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;

import org.apache.poi.poifs.filesystem.FileMagic;

import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;

public class ReadWord {

 static String read(InputStream is) throws Exception {

System.out.println(FileMagic.valueOf(is));

  String text = "";

  if (FileMagic.valueOf(is) == FileMagic.OLE2) {
   WordExtractor ex = new WordExtractor(is);
   text = ex.getText();
   ex.close();
  } else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
   XWPFDocument doc = new XWPFDocument(is);
   XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
   text = extractor.getText();
   extractor.close();
  }

  return text;

 }

 public static void main(String[] args) throws Exception {

  InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
  System.out.println(read(is));
  is.close();

  is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
  System.out.println(read(is));
  is.close();

  is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
  System.out.println(read(is));
  is.close();

 }
}

Answer 2

try {
    new ZipFile(new File("/Users/giang/Documents/a.doc"));
    System.out.println("this file is .docx");
} catch (ZipException e) {
    System.out.println("this file is not .docx");
    e.printStackTrace();
}

如何在POI中判断文件是doc还是docx

2 个答案: