Question

我需要从一些PDF文档中提取数据（使用Java）。我需要知道最简单的方法是什么。

我试过iText。这对我的需求来说相当复杂。此外，我猜它不适用于商业项目。所以这不是一个选择。我还尝试了PDFBox，并遇到了各种NoClassDefFoundError错误。

我用Google搜索并遇到了其他几个选项，例如PDF Clown，jPod，但我没有时间尝试所有这些库。我依靠社区通过Java阅读PDF的经验。

请注意，我不需要创建或操作PDF文档。我只需要从中等级别的布局复杂性中提取PDF文档中的文本数据。

请建议从PDF文档中提取文本的最快捷最简单的方法。感谢。

Answer 1

我建议您尝试 Apache Tika 。 Apache Tika基本上是一个工具包，可以从许多类型的文档中提取数据，包括PDF。

Tika（除了免费）的好处是，它曾经是Apache Lucene的一个子项目，它是一个非常强大的开源搜索引擎。 Tika包含一个内置的PDF解析器，它使用SAX内容处理程序将PDF数据传递给您的应用程序。它还可以从加密的PDF中提取数据，它允许您创建或子类化现有的解析器以自定义行为。

代码很简单。要从PDF中提取数据，您需要做的就是创建一个实现Parser接口的Parser类并定义一个parse（）方法：

public void parse(
   InputStream stream, ContentHandler handler,
   Metadata metadata, ParseContext context)
   throws IOException, SAXException, TikaException {

   metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
   metadata.set("Hello", "World");

   XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
   xhtml.startDocument();
   xhtml.endDocument();
}

然后，要运行解析器，您可以执行以下操作：

InputStream input = new FileInputStream(new File(resourceLocation));
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata);
input.close();
out.println("Title: " + metadata.get("title"));
out.println("Author: " + metadata.get("Author"));
out.println("content: " + textHandler.toString());

Answer 2

我正在使用JPedal，我对结果非常满意。它不是免费的，但它的质量很高，从pdfs或文本提取生成图像的输出非常好。

作为付费图书馆，我们随时都会给予支持。

Answer 3

我使用PDFBox为Lucene索引提取文本而没有太多问题。如果我没记错的话，它的错误/警告记录非常详细 - 你收到这些错误的原因是什么？

Answer 4

我理解这篇文章已经很老了，但我建议从这里使用itext： http://sourceforge.net/projects/itext/ 如果您正在使用maven，您可以从maven中心拉入罐子： http://mvnrepository.com/artifact/com.itextpdf/itextpdf

我无法理解如何使用它可能很困难：

    PdfReader pdf = new PdfReader("path to your pdf file");
    PdfTextExtractor parser = new PdfTextExtractor();
    String output = parser.getTextFromPage(pdf, pageNumber);
    assert output.contains("whatever you want to validate on that page");

Answer 5

导入此类并添加Jar文件1.- pdfbox-app- 2.0。

   import org.openqa.selenium.WebDriver;
   import org.openqa.selenium.WebElement;
   import org.openqa.selenium.support.FindBy;
   import org.testng.Assert;
   import org.testng.annotations.Test;

   import java.io.File;
   import java.io.IOException;
   import java.text.ParseException;
   import java.util.List;

   import org.apache.log4j.Logger;
   import org.apache.log4j.PropertyConfigurator;
   import org.apache.pdfbox.pdmodel.PDDocument;
   import org.apache.pdfbox.text.PDFTextStripper;
   import org.openqa.selenium.By;
   import org.openqa.selenium.chrome.ChromeDriver;


   import com.coencorp.selenium.framework.BasePage;
   import com.coencorp.selenium.framework.ExcelReadWrite;
   import com.relevantcodes.extentreports.LogStatus;

将此代码添加到类中。

   public void showList() throws InterruptedException, IOException {

   showInspectionsLink.click();
   waitForElement(hideInspectionsLink);
   printButton.click();
   Thread.sleep(10000);
   String downloadPath = "C:\\Users\\Updoer\\Downloads";
   File getLatestFile = getLatestFilefromDir(downloadPath);
   String fileName = getLatestFile.getName();
   Assert.assertTrue(fileName.equals("Inspections.pdf"), "Downloaded file name is not 
   matching with expected file name");
   Thread.sleep(10000);
   //testVerifyPDFInURL();
   PDDocument pd;
   pd= PDDocument.load(new File("C:\\Users\\Updoer\\Downloads\\Inspections.pdf"));
   System.out.println("Total Pages:"+ pd.getNumberOfPages());
   PDFTextStripper pdf=new PDFTextStripper();
   System.out.println(pdf.getText(pd));

将此方法添加到同一类中。

   public void testVerifyPDFInURL() {
   WebDriver driver = new ChromeDriver();
   driver.get("C:\\Users\\Updoer\\Downloads\\Inspections.pdf");
   driver.findElement(By.linkText("Adeeb Khan")).click();
   String getURL = driver.getCurrentUrl();
   Assert.assertTrue(getURL.contains(".pdf"));
   }

   private File getLatestFilefromDir(String dirPath){
   File dir = new File(dirPath);
   File[] files = dir.listFiles();
   if (files == null || files.length == 0) {
        return null;
   }

   File lastModifiedFile = files[0];
   for (int i = 1; i < files.length; i++) {
   if (lastModifiedFile.lastModified() < files[i].lastModified()) {
   lastModifiedFile = files[i];
   }
   }
   return lastModifiedFile;
   }

从PDF中提取数据的最简单方法是什么？

5 个答案: