Question

我希望使用PDFBox在PDF页面中的每一行的坐标。我正在获取字符级别信息，但无法获得行坐标。

以下是我的代码：

public class PDFFontExtractor extends PDFTextStripper {

public PDFFontExtractor() throws IOException {
    super();
}

@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {

    System.out.println(str);
    for(TextPosition text : textPositions) {
        System.out.println(text.getFont().getName());
        System.out.println(text.getFontSizeInPt());
    }
}
public static void main(String[] args) {
  File file = new File("/home/neha/Downloads/docs/General.pdf");


try {
        PDDocument document = PDDocument.load(file);
        PDFFontExtractor textStripper = new PDFFontExtractor();
        textStripper.setSortByPosition(true);   
        textStripper.writeText(document, NullWriter.NULL_WRITER);
        }               

    } catch (Exception e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
 }
}

Answer 1

如果您只是寻求文字和pdf的页面/行坐标，你可以这样实现：

public class PDFFontExtractor extends PDFTextStripper {

    public PDFFontExtractor() throws IOException {
        super();
    }

    public static void main(String[] args) {

        try (PDDocument document = PDDocument.load(new File("/home/neha/Downloads/docs/General.pdf"))) {
            PDFFontExtractor textStripper = new PDFFontExtractor();
            textStripper.setSortByPosition(true);
            for (int page = 1; page <= document.getNumberOfPages(); page++) {
                textStripper.setStartPage(page);
                textStripper.setEndPage(page);
                String pdfFileText = textStripper.getText(document);
                // split by line
                String lines[] = pdfFileText.split("\\n");
                for (int line = 0; line < lines.length; line++) {
                    System.out.println(String.format("Page: %s, Line: %s, Text: %s", page, line, lines[line]));
                }
            }

        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

Answer 2

我不确定这是否可行，我查看了org.apache.pdfbox.text.PDFTextStripper的实现，发现org.apache.pdfbox.text.PDFTextStripper#writeLine是private：

 /**
 * Write a list of string containing a whole line of a document.
 * 
 * @param line a list with the words of the given line
 * @throws IOException if something went wrong
 */
private void writeLine(List<WordWithTextPositions> line)
        throws IOException
{
    int numberOfStrings = line.size();
    for (int i = 0; i < numberOfStrings; i++)
    {
        WordWithTextPositions word = line.get(i);
        writeString(word.getText(), word.getTextPositions());
        if (i < numberOfStrings - 1)
        {
            writeWordSeparator();
        }
    }
}

https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java?view=markup&sortby=date中的示例显示了如何获取单词的坐标。如果运行代码，您将看到实现将在每个字符上绘制一个矩形。也许，如果有人为Apache填了一张罚单，使我们可以覆盖that，那么特殊方法将是一个很好的补充。

使用PDFBox java获取pdf的线坐标

2 个答案: