PDF文章区域标识

时间:2018-04-19 20:29:50

标签: pdf pdfbox

我可以使用哪些PdfBox API来识别区域是包含一篇文章的矩形的区域,以便我可以提取文章的文本。

我正在考虑解析PDF内容,其中封装文本的大空白区域将被识别为区域。

这是一个代码,它提取一个区域,其中区域的大小和位置是硬编码的:

System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument pdf = PDDocument.load(new File("SevenPropertiesofHighlySecureDevices.pdf"));

PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);

// 1 pt is equal to 1/72 inch
Rectangle2D rectangle = new Rectangle2D.Double(0,0,200,200);
String regionName = "First Article Region";
stripper.addRegion(regionName, rectangle);
stripper.extractRegions(pdf.getPage(0));

LOGGER.info("getTextForRegion: \n{}", stripper.getTextForRegion(regionName));

如果有人想要“扫描”PDF以找出白色空间区域的位置,以便此信息用于确定矩形区域以便提取文章,那么我想告诉您它非常慢,不太可能使用:

System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument pdf = PDDocument.load(new File("4.pdf"));
PDPage page = pdf.getPage(1);

PDRectangle cropBox = page.getCropBox();
float docWidth = cropBox.getUpperRightX();
float docHeight = cropBox.getUpperRightY();  
float recWidth = 10;
float rectHeight = 10;
float xStep = recWidth / 2;
float yStep = rectHeight / 2;

String regionName = "docScannerRegion";
String docLeftMarginIndicator = "|";
String docRightMarginIndicator = "|";
String nonTextArea = " ";
String textArea = "-";

PDFTextStripperByArea stripper = new PDFTextStripperByArea();

for (float y = 0; y < docHeight; y = y + yStep) {

    System.out.print(docLeftMarginIndicator);

    for(float x = 0; x < docWidth; x = x + xStep) {

        stripper.addRegion(regionName, new Rectangle2D.Double(x, y, recWidth, rectHeight));
        stripper.extractRegions(page);
        String txt = stripper.getTextForRegion(regionName).trim();
        if (StringUtils.isBlank(txt)) {
            System.out.print(nonTextArea);
        } else {
            System.out.print(textArea);
        }
    }
    System.out.println(docRightMarginIndicator);
}

0 个答案:

没有答案