有没有办法使用PDFBox获取PDF文件的每一行的字体?我试过这个,但它只列出了该页面中使用的所有字体。它不显示该字体显示的行或文本。
List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages)
{
Map<String,PDFont> pageFonts=page.getResources().getFonts();
for(String key : pageFonts.keySet())
{
System.out.println(key+" - "+pageFonts.get(key));
System.out.println(pageFonts.get(key).getBaseFont());
}
}
任何输入都表示赞赏。谢谢!
答案 0 :(得分:13)
每当您尝试使用PDFBox从PDF中提取文本(普通或带有样式信息)时,通常应该开始尝试使用PDFTextStripper
类或其中一个亲戚。这个类已经完成了PDF内容解析所涉及的所有繁重任务。
您使用普通的PDFTextStripper
类,如下所示:
PDDocument document = ...;
PDFTextStripper stripper = new PDFTextStripper();
// set stripper start and end page or bookmark attributes unless you want all the text
String text = stripper.getText(document);
这仅返回纯文本,例如来自某些R40表格:
Claim for repayment of tax deducted from savings and investments How to fill in this form Please fill in this form with details of your income for the above tax year. The enclosed Notes will help you (but there is not a note for every box on the form). If you need more help with anything on this form, please phone us on the number shown above. If you are not a UK resident, do not use this form – please contact us. Please do not send us any personal records, or tax certificates or vouchers with your form. We will contact you if we need these. Please allow four weeks before contacting us about your repayment. We will pay you as quickly as possible. Use black ink and capital letters Cross out any mistakes and write the correct information below ...
另一方面,您可以覆盖其方法writeString(String, List<TextPosition>)
并处理比纯文本更多的信息。要在字体更改的位置添加有关使用字体名称的信息,可以使用:
PDFTextStripper stripper = new PDFTextStripper() {
String prevBaseFont = "";
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
StringBuilder builder = new StringBuilder();
for (TextPosition position : textPositions)
{
String baseFont = position.getFont().getBaseFont();
if (baseFont != null && !baseFont.equals(prevBaseFont))
{
builder.append('[').append(baseFont).append(']');
prevBaseFont = baseFont;
}
builder.append(position.getCharacter());
}
writeString(builder.toString());
}
};
获得相同的表格
[DHSLTQ+IRModena-Bold]Claim for repayment of tax deducted from savings and investments How to fill in this form [OIALXD+IRModena-Regular]Please fill in this form with details of your income for the above tax year. The enclosed Notes will help you (but there is not a note for every box on the form). If you need more help with anything on this form, please phone us on the number shown above. If you are not a UK resident, do not use this form – please contact us. [DHSLTQ+IRModena-Bold]Please do not send us any personal records, or tax certificates or vouchers with your form. We will contact you if we need these. [OIALXD+IRModena-Regular]Please allow four weeks before contacting us about your repayment. We will pay you as quickly as possible. Use black ink and capital letters Cross out any mistakes and write the correct information below ...
如果您不希望将字体信息与文本合并,只需在方法覆盖中创建单独的结构。
TextPosition
提供了有关其所代表的文字的更多信息。检查它!
答案 1 :(得分:1)
要添加到mkl的答案,如果您使用的是pdfbox 2.0.8:
position.getFont().getName()
代替position.getFont().getBaseFont()
position.getUnicode()
代替position.getCharacter()
有关PDFont和Text Position的更多信息,请访问他们的Javadocs在线。