我可以使用哪些PdfBox API来识别区域是包含一篇文章的矩形的区域,以便我可以提取文章的文本。
我正在考虑解析PDF内容,其中封装文本的大空白区域将被识别为区域。
这是一个代码,它提取一个区域,其中区域的大小和位置是硬编码的:
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument pdf = PDDocument.load(new File("SevenPropertiesofHighlySecureDevices.pdf"));
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
// 1 pt is equal to 1/72 inch
Rectangle2D rectangle = new Rectangle2D.Double(0,0,200,200);
String regionName = "First Article Region";
stripper.addRegion(regionName, rectangle);
stripper.extractRegions(pdf.getPage(0));
LOGGER.info("getTextForRegion: \n{}", stripper.getTextForRegion(regionName));
如果有人想要“扫描”PDF以找出白色空间区域的位置,以便此信息用于确定矩形区域以便提取文章,那么我想告诉您它非常慢,不太可能使用:
System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider");
PDDocument pdf = PDDocument.load(new File("4.pdf"));
PDPage page = pdf.getPage(1);
PDRectangle cropBox = page.getCropBox();
float docWidth = cropBox.getUpperRightX();
float docHeight = cropBox.getUpperRightY();
float recWidth = 10;
float rectHeight = 10;
float xStep = recWidth / 2;
float yStep = rectHeight / 2;
String regionName = "docScannerRegion";
String docLeftMarginIndicator = "|";
String docRightMarginIndicator = "|";
String nonTextArea = " ";
String textArea = "-";
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
for (float y = 0; y < docHeight; y = y + yStep) {
System.out.print(docLeftMarginIndicator);
for(float x = 0; x < docWidth; x = x + xStep) {
stripper.addRegion(regionName, new Rectangle2D.Double(x, y, recWidth, rectHeight));
stripper.extractRegions(page);
String txt = stripper.getTextForRegion(regionName).trim();
if (StringUtils.isBlank(txt)) {
System.out.print(nonTextArea);
} else {
System.out.print(textArea);
}
}
System.out.println(docRightMarginIndicator);
}