使用PDFBox从PDF中提取字体颜色和字体类型

时间:2016-10-27 10:36:19

标签: java pdf fonts pdfbox

我需要通过Java(使用PDFBox)从PDF中提取字体颜色以及字体类型[例如,黑色,Tahoma,粗体]。下面是我编写的用于提取字体类型并在提取的文本中嵌入相同内容的代码。

public class PDFParse {

 public static void main(String args[]) {
        PDFTextStripper pdfStripper = null;
        PDDocument pdDoc = null;
        COSDocument cosDoc = null;
        File file = new File("Sample Bill.pdf");
        try {
            PDFParser parser = new PDFParser(new FileInputStream(file));
            parser.parse();
            cosDoc = parser.getDocument();
             pdfStripper = new PDFTextStripper() {
                String prevBaseFont = "";

                protected void writeString(String text, List<TextPosition> textPositions) throws IOException
                {
                    StringBuilder builder = new StringBuilder();

                    for (TextPosition position : textPositions)
                    {
                        String baseFont = position.getFont().getBaseFont();
                        if (baseFont != null && !baseFont.equals(prevBaseFont))
                        {
                            builder.append('[').append(baseFont).append(']');
                            prevBaseFont = baseFont;
                        }
                        builder.append(position.getCharacter());
                    }

                    writeString(builder.toString());
                }

            };
            pdDoc = new PDDocument(cosDoc);
            pdfStripper.setStartPage(1);
            pdfStripper.setEndPage(5);
            pdfStripper.setSortByPosition(true);
            String parsedText = pdfStripper.getText(pdDoc);
            PrintWriter out = new PrintWriter("sample.txt");
            out.println(parsedText);
            out.close();
           System.out.println(parsedText);
    }
}

如何提取每个单词的字体颜色并将其嵌入到同一个提取的文件中?谢谢:))

0 个答案:

没有答案