无法使用java apache pdfbox从PDF中提取特定坐标的值

时间:2019-03-14 04:24:21

标签: java pdfbox

我的任务是从PDF中提取特定坐标的文本。

我已经使用Apache Pdfbox客户端进行数据提取。

要使用PDF以毫米为单位的PDF更改工具从PDF获取x,y,高度和宽度坐标。当我在矩形中传递值时,这些值不会变为空值。

public String getTextUsingPositionsUsingPdf(String pdfLocation, int pageNumber, double x, double y, double width,
                double height) throws IOException {
            String extractedText = "";
            // PDDocument Creates an empty PDF document. You need to add at least
            // one page for the document to be valid.
            // Using load method we can load a PDF document
            PDDocument document = null;
            PDPage page = null;
            try {
                if (pdfLocation.endsWith(".pdf")) {
                    document = PDDocument.load(new File(pdfLocation));
                    int getDocumentPageCount = document.getNumberOfPages();
                    System.out.println(getDocumentPageCount);

                    // Get specific page. THe parameter is pageindex which starts with // 0. If we need to
                    // access the first page then // the pageIdex is 0 PDPage
                    if (getDocumentPageCount > 0) {
                        page = document.getPage(pageNumber + 1);
                    } else if (getDocumentPageCount == 0) {
                        page = document.getPage(0);
                    }
                    // To create a rectangle by passing the x axis, y axis, width and height 
                    Rectangle2D rect = new Rectangle2D.Double(x, y, width, height);
                    String regionName = "region1";

                    // Strip the text from PDF using PDFTextStripper Area with the
                    // help of Rectangle and named need to given for the rectangle
                    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                    stripper.setSortByPosition(true);
                    stripper.addRegion(regionName, rect);
                    stripper.extractRegions(page);
                    System.out.println("Region is " + stripper.getTextForRegion("region1"));
                    extractedText = stripper.getTextForRegion("region1");
                } else {
                    System.out.println("No data return");
                }
            } catch (IOException e) {
                System.out.println("The file  not found" + "");
            } finally {
                document.close();
            }
            // Return the extracted text and this can be used for assertion
            return extractedText;
        }

请建议我的方法是否正确。

1 个答案:

答案 0 :(得分:1)

  

我已使用此PDF tutorialspoint.com/uipath/uipath_tutorial.pdf。在尝试查找文本“比赛的一部分”的地方,该文本的x = 55.6毫米y = 168.8宽度= 210.0毫米,高度= 297.0。但是我得到的是空值

我用这些输入测试了您的方法:

bytes=4
((max=2**(bytes*8)-1))
for ((i=0; i<max; i++)); do
   printf -v hex %08x "$i"
   cansend slcan0 "7e0#02090${hex}0000000000"
done

ExtractText测试System.out.println("Extracting like Venkatachalam Neelakantan from uipath_tutorial.pdf\n"); float MM_TO_UNITS = 1/(10*2.54f)*72; String text = getTextUsingPositionsUsingPdf("src/test/resources/mkl/testarea/pdfbox2/extract/uipath_tutorial.pdf", 0, 55.6 * MM_TO_UNITS, 168.8 * MM_TO_UNITS, 210.0 * MM_TO_UNITS, 297.0 * MM_TO_UNITS); System.out.printf("\n---\nResult:\n%s\n", text);

得到结果

testUiPathTutorial

假设您实际上是在寻找“内容的一部分”,而不是“比赛的一部分”,只是缺少了“ a”;可能是在测量时寻找了可见字母图形的开始,但是实际字形的起源要早于此。如果您选择一个较小的 x ,例如54.6毫米,您还会得到'a'。

考虑到矩形的宽度和高度,获得的不仅仅是“一部分内容”,这也就不足为奇了。

您是否想知道 part of contents of this e-book in any manner without written consent te the contents of our website and tutorials as timely and as precisely as , the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. guarantee regarding the accuracy, timeliness or completeness of our tents including this tutorial. If you discover any errors on our website or ease notify us at contact@tutorialspoint.com i 常量,请看一下this answer