使用PDFBox获取每行的字体

时间:2014-02-11 15:25:38

标签: pdf fonts pdfbox

有没有办法使用PDFBox获取PDF文件的每一行的字体?我试过这个,但它只列出了该页面中使用的所有字体。它不显示该字体显示的行或文本。

List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages)
{
Map<String,PDFont> pageFonts=page.getResources().getFonts();
for(String key : pageFonts.keySet())
   {
    System.out.println(key+" - "+pageFonts.get(key));
    System.out.println(pageFonts.get(key).getBaseFont());
    }
}

任何输入都表示赞赏。谢谢!

2 个答案:

答案 0 :(得分:13)

每当您尝试使用PDFBox从PDF中提取文本(普通或带有样式信息)时,通常应该开始尝试使用PDFTextStripper类或其中一个亲戚。这个类已经完成了PDF内容解析所涉及的所有繁重任务。

您使用普通的PDFTextStripper类,如下所示:

PDDocument document = ...;
PDFTextStripper stripper = new PDFTextStripper();
// set stripper start and end page or bookmark attributes unless you want all the text
String text = stripper.getText(document);

这仅返回纯文本,例如来自某些R40表格:

Claim for repayment of tax deducted 
from savings and investments
How to fill in this form
Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please 
contact us.
Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact 
you if we need these.
Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...

另一方面,您可以覆盖其方法writeString(String, List<TextPosition>)并处理比纯文本更多的信息。要在字体更改的位置添加有关使用字体名称的信息,可以使用:

PDFTextStripper stripper = new PDFTextStripper() {
    String prevBaseFont = "";

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        StringBuilder builder = new StringBuilder();

        for (TextPosition position : textPositions)
        {
            String baseFont = position.getFont().getBaseFont();
            if (baseFont != null && !baseFont.equals(prevBaseFont))
            {
                builder.append('[').append(baseFont).append(']');
                prevBaseFont = baseFont;
            }
            builder.append(position.getCharacter());
        }

        writeString(builder.toString());
    }
};

获得相同的表格

[DHSLTQ+IRModena-Bold]Claim for repayment of tax deducted 
from savings and investments
How to fill in this form
[OIALXD+IRModena-Regular]Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please 
contact us.
[DHSLTQ+IRModena-Bold]Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact 
you if we need these.
[OIALXD+IRModena-Regular]Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...

如果您不希望将字体信息与文本合并,只需在方法覆盖中创建单独的结构。

TextPosition提供了有关其所代表的文字的更多信息。检查它!

答案 1 :(得分:1)

要添加到mkl的答案,如果您使用的是pdfbox 2.0.8:

  • 使用position.getFont().getName()代替position.getFont().getBaseFont()
  • 使用position.getUnicode()代替position.getCharacter()

有关PDFontText Position的更多信息,请访问他们的Javadocs在线。