Question

您好我正在尝试从doc和docx文件中读取文本，对于我正在执行此操作的文件

package test;
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class ReadFile {
public static void main(String[] args) {
        File file = null;
        WordExtractor extractor = null;
        try {

            file = new File("C:\\Users\\rijo\\Downloads\\r.doc");
            FileInputStream fis = new FileInputStream(file.getAbsolutePath());
            HWPFDocument document = new HWPFDocument(fis);
            extractor = new WordExtractor(document);
            String fileData = extractor.getText();
            System.out.println(fileData);
        } catch (Exception exep) {
        }
    }
}

但这给了我一个org/apache/poi/OldFileFormatException例外。

知道如何解决这个问题吗？

另外我需要阅读Docx和PDF文件吗？有什么好方法可以阅读所有类型的文件吗？

Answer 1

使用以下罐子（如果版本号在这里起作用）：

dom4j-1.7-20060614
poi-3.9-20121203
poi-ooxml-3.9-20121203
poi-ooxml-schemas-3.9-20121203
poi-scratchpad-3.9-20121203
xmlbeans-2.4.0

我打了这个：

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class SO {
public static void main(String[] args){

            //Alternate between the two to check what works.
    //String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx";
    String FilePath = "D:\\Users\\username\\Desktop\\Bob.doc";
    FileInputStream fis;

    if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx
    try {
        fis = new FileInputStream(new File(FilePath));
        XWPFDocument doc = new XWPFDocument(fis);
        XWPFWordExtractor extract = new XWPFWordExtractor(doc);
        System.out.println(extract.getText());
    } catch (IOException e) {

        e.printStackTrace();
    }
    } else { //is not a docx
        try {
            fis = new FileInputStream(new File(FilePath));
            HWPFDocument doc = new HWPFDocument(fis);
            WordExtractor extractor = new WordExtractor(doc);
            System.out.println(extractor.getText());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
  }
}

这允许我分别从.docx和.doc中读取文本。如果这在您的PC上不起作用，您可能会遇到使用外部罐的问题。

试一试:) 祝你好运！

Answer 2

如果查看OldFileFormatException 的javadoc，可以看到原因

POI在事件中抛出的所有异常的基类它提供的文件比目前支持的文件旧。

这意味着HWPFDocument不支持您r.doc使用的docx。可能它支持最新的格式（doc现在已经存在很长时间了。不确定ApachePOI是否支持HWPFDocument中的{{1}}格式。

Answer 3

我不知道你为什么只使用WordExtractor从.doc获取文本。对我来说，使用一种方法就足够了：

import org.apache.poi.hwpf.HWPFDocument;
...
File fin = new File(yourFilePath);
FileInputStream fis = new FileInputStream(fin);
HWPFDocument doc = new HWPFDocument(fis);
String text = doc.getDocumentText();
System.out.println(text);
...

使用.pdf使用另一个Apache：pdfbox。

Java使用POI读取.doc文件

3 个答案: