我想从pdf文件中提取文本。为此,我使用pdfbox。首先,我添加以下依赖项:
import org.apache.pdfbox.cos.COSDocument;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File("C:/Users/Ann/Desktop/example.pdf");
try {
PDFParser parser = new PDFParser(new FileInputStream(file)); // in this line i get error
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(5);
String parsedText = pdfStripper.getText(pdDoc);
System.out.println(parsedText);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
所以,这里是我从pdf中提取文本的代码:
System.out.println(stripComments( "a #b \nc \nd $e f g", new String[] { "#", "$" } ));
但是我收到错误:错误:(22,46)java:不兼容的类型:java.io.FileInputStream无法转换为org.apache.pdfbox.io.RandomAccessRead。
请帮我解决这个问题。
答案 0 :(得分:1)
尝试使用以下代码:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
File file = new File("D:/example.pdf");
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfTextStripper = new PDFTextStripper();
pdfTextStripper.setStartPage(1);
pdfTextStripper.setEndPage(5);
String text = pdfTextStripper.getText(document);
System.out.println(text);
document.close();
}
}
答案 1 :(得分:0)
尝试使用以下内容:
{"data": [".","..",".editorconfig",".gitignore",".htaccess",".well-known","README.md","application","assets","cgi-bin","composer.json","contributing.md","index.php","license.txt","readme.rst","system"]}