Question

我想从pdf文件中提取文本。为此，我使用pdfbox。首先，我添加以下依赖项：

import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

public class Main {

    public static void main(String[] args) {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File("C:/Users/Ann/Desktop/example.pdf");
        try {


            PDFParser parser = new PDFParser(new FileInputStream(file)); // in this line i get error
            parser.parse();
            cosDoc = parser.getDocument();
            pdfStripper = new PDFTextStripper();
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(5);
            String parsedText = pdfStripper.getText(pdDoc);
            System.out.println(parsedText);
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

}

所以，这里是我从pdf中提取文本的代码：

System.out.println(stripComments( "a #b \nc \nd $e f g", new String[] { "#", "$" } ));

但是我收到错误：错误：（22,46）java：不兼容的类型：java.io.FileInputStream无法转换为org.apache.pdfbox.io.RandomAccessRead。

请帮我解决这个问题。

Answer 1

尝试使用以下代码：

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
public class Main {
    public static void main(String[] args) throws IOException {
        File file = new File("D:/example.pdf");
        PDDocument document = PDDocument.load(file);
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        pdfTextStripper.setStartPage(1);
        pdfTextStripper.setEndPage(5);
        String text  = pdfTextStripper.getText(document);
        System.out.println(text);
        document.close();
    }
}

Answer 2

尝试使用以下内容：

{"data": [".","..",".editorconfig",".gitignore",".htaccess",".well-known","README.md","application","assets","cgi-bin","composer.json","contributing.md","index.php","license.txt","readme.rst","system"]}

从pdf文件中提取文本时出错（java + pdfbox）

2 个答案: