我希望使用PDFBox在PDF页面中的每一行的坐标。 我正在获取字符级别信息,但无法获得行坐标。
以下是我的代码:
public class PDFFontExtractor extends PDFTextStripper {
public PDFFontExtractor() throws IOException {
super();
}
@Override
protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
System.out.println(str);
for(TextPosition text : textPositions) {
System.out.println(text.getFont().getName());
System.out.println(text.getFontSizeInPt());
}
}
public static void main(String[] args) {
File file = new File("/home/neha/Downloads/docs/General.pdf");
try {
PDDocument document = PDDocument.load(file);
PDFFontExtractor textStripper = new PDFFontExtractor();
textStripper.setSortByPosition(true);
textStripper.writeText(document, NullWriter.NULL_WRITER);
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
答案 0 :(得分:0)
如果您只是寻求文字和pdf的页面/行坐标,你可以这样实现:
public class PDFFontExtractor extends PDFTextStripper {
public PDFFontExtractor() throws IOException {
super();
}
public static void main(String[] args) {
try (PDDocument document = PDDocument.load(new File("/home/neha/Downloads/docs/General.pdf"))) {
PDFFontExtractor textStripper = new PDFFontExtractor();
textStripper.setSortByPosition(true);
for (int page = 1; page <= document.getNumberOfPages(); page++) {
textStripper.setStartPage(page);
textStripper.setEndPage(page);
String pdfFileText = textStripper.getText(document);
// split by line
String lines[] = pdfFileText.split("\\n");
for (int line = 0; line < lines.length; line++) {
System.out.println(String.format("Page: %s, Line: %s, Text: %s", page, line, lines[line]));
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
答案 1 :(得分:0)
我不确定这是否可行,我查看了org.apache.pdfbox.text.PDFTextStripper
的实现,发现org.apache.pdfbox.text.PDFTextStripper#writeLine
是private
:
/**
* Write a list of string containing a whole line of a document.
*
* @param line a list with the words of the given line
* @throws IOException if something went wrong
*/
private void writeLine(List<WordWithTextPositions> line)
throws IOException
{
int numberOfStrings = line.size();
for (int i = 0; i < numberOfStrings; i++)
{
WordWithTextPositions word = line.get(i);
writeString(word.getText(), word.getTextPositions());
if (i < numberOfStrings - 1)
{
writeWordSeparator();
}
}
}
https://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/DrawPrintTextLocations.java?view=markup&sortby=date中的示例显示了如何获取单词的坐标。如果运行代码,您将看到实现将在每个字符上绘制一个矩形。也许,如果有人为Apache填了一张罚单,使我们可以覆盖that
,那么特殊方法将是一个很好的补充。