Question

我正在使用PDF iText库将PDF转换为文本。

以下是使用Java将PDF转换为文本文件的代码。

public class PdfConverter {

/** The original PDF that will be parsed. */
public static final String pdfFileName = "jdbc_tutorial.pdf";
/** The resulting text file. */
public static final String RESULT = "preface.txt";

/**
 * Parses a PDF to a plain text file.
 * @param pdf the original PDF
 * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {
    PdfReader reader = new PdfReader(pdf);
    PdfReaderContentParser parser = new PdfReaderContentParser(reader);
    PrintWriter out = new PrintWriter(new FileOutputStream(txt));

    TextExtractionStrategy strategy;
    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
        out.println(strategy.getResultantText());
        System.out.println(strategy.getResultantText());
    }
    out.flush();
    out.close();
    reader.close();
}

/**
 * Main method.
 * @param    args    no arguments needed
 * @throws IOException
 */
public static void main(String[] args) throws IOException {
    new PdfConverter().parsePdf(pdfFileName, RESULT);
}
}

上述代码适用于将PDF提取为文本。但我的要求是忽略页眉和页脚，只从PDF文件中提取内容。

Answer 1

因为你的pdf有页眉和页脚，所以它会被标记为工件（如果不是它只是一个文本或内容放在页眉或页脚的位置）。如果将其标记为工件，则可以使用ParseTaggedPdf提取它。如果ParseTaggedPdf不起作用，您也可以使用ExtractPageContentArea。您可以查看一些与之相关的示例。

以上解决方案是通用的，取决于文件。如果您确实需要备用解决方案，可以使用apache API，如PdfBox，tika和其他类似PDFTextStream。如果您必须坚持使用iText并且无法继续使用其他库，我在下面给出的解决方案将无法工作。在PdfBox中，您可以使用PDFTextStripperByArea或PDFTextStripper。如果您需要知道如何使用它，请查看JavaDoc或一些示例。

Answer 2

使用IText我在此网站http://what-when-how.com/itext-5/parsing-pdfs-part-2-itext-5/

中找到了一个示例

在此创建一个矩形，用于定义要获取的文本的边界。

PdfReader reader = new PdfReader(pdf);
PrintWriter out= new PrintWriter(new FileOutputStream(txt));
//Creating the rectangle
Rectangle rect=new Rectangle(70,80,420,500);
//creating a filter based on the rectangle
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for(int i=1;i<=reader.getNumberOfPages();i+){
    //setting the filter on the text extraction strategy
    strategy= new FilteredTextRenderListener(
      new LocationTextExtractionStrategy(),filter);
    out.println(PdfTextExtractor.getTextFromPage(reader,i,strategy));
}
out.flush();out.close();

因为网页描述了这一点，即使没有标记pdf，它也应该有效。

Answer 3

您可以阅读pdf文件的特定位置。只需标记需要从中获取文本的那些区域，然后保留显示页眉和页脚的区域即可。我已经完成了，这里是完整的代码。 itext reading specific location from pdf file runs in intellij and gives desired output but executable jar throws error

如何使用Java中的iText从PDF文件中删除页眉和页脚

3 个答案: