从pdf文件中提取文本时出错(java + pdfbox)

时间:2018-04-01 18:36:08

标签: java pdfbox

我想从pdf文件中提取文本。为此,我使用pdfbox。首先,我添加以下依赖项:

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

public class Main {

    public static void main(String[] args) {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File("C:/Users/Ann/Desktop/example.pdf");
        try {


            PDFParser parser = new PDFParser(new FileInputStream(file)); // in this line i get error
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(5);
            String parsedText = pdfStripper.getText(pdDoc);
            System.out.println(parsedText);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}

所以,这里是我从pdf中提取文本的代码:

System.out.println(stripComments( "a #b \nc \nd $e f g", new String[] { "#", "$" } ));

但是我收到错误:错误:(22,46)java:不兼容的类型:java.io.FileInputStream无法转换为org.apache.pdfbox.io.RandomAccessRead。

请帮我解决这个问题。

2 个答案:

答案 0 :(得分:1)

尝试使用以下代码:

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
public class Main {
    public static void main(String[] args) throws IOException {
        File file = new File("D:/example.pdf");
        PDDocument document = PDDocument.load(file);
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        pdfTextStripper.setStartPage(1);
        pdfTextStripper.setEndPage(5);
        String text  = pdfTextStripper.getText(document);
        System.out.println(text);
        document.close();
    }
}

答案 1 :(得分:0)

尝试使用以下内容:

{"data": [".","..",".editorconfig",".gitignore",".htaccess",".well-known","README.md","application","assets","cgi-bin","composer.json","contributing.md","index.php","license.txt","readme.rst","system"]}