如何使用PDFBox确定实际PDF内容的位置?

时间:2018-10-15 16:56:56

标签: java pdfbox

我们正在使用PDFBox从Java桌面应用程序中打印一些PDF,并且这些PDF包含太多空格(不幸的是,不能选择修复PDF生成器)。

我遇到的问题是确定页面上实际内容的位置,因为裁剪/媒体/修剪/艺术/出血框没有用。是否有一些简单有效的方法,比将页面呈现为图像并检查哪些像素保持白色更好/更快?

enter image description here

1 个答案:

答案 0 :(得分:2)

正如您在评论中提到的那样

  

可以假定没有背景或其他需要特殊处理的元素,

我将展示基本解决方案,而无需进行任何特殊处理。

基本的边界框查找器

要找到边界框而不实际渲染到位图并检查位图像素,必须扫描页面内容流的所有指令以及从那里引用的所有XObject。可以确定每条指令绘制的内容的边界框,并最终将它们组合为一个框。

这里展示的简单盒子查找器通过简单地返回其并集的边界框来将它们组合在一起。

为了扫描内容流的指令,PDFBox提供了基于PDFStreamEngine的许多类。简单的盒子查找器是从PDFGraphicsStreamEngine派生而来的,它通过与矢量图形有关的某种方法扩展了PDFStreamEngine

public class BoundingBoxFinder extends PDFGraphicsStreamEngine {
    public BoundingBoxFinder(PDPage page) {
        super(page);
    }

    public Rectangle2D getBoundingBox() {
        return rectangle;
    }

    //
    // Text
    //
    @Override
    protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
            throws IOException {
        super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
        Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
        if (shape != null) {
            Rectangle2D rect = shape.getBounds2D();
            add(rect);
        }
    }

    /**
     * Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
     */
    private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
    {
        GeneralPath path = null;
        AffineTransform at = textRenderingMatrix.createAffineTransform();
        at.concatenate(font.getFontMatrix().createAffineTransform());
        if (font instanceof PDType3Font)
        {
            // It is difficult to calculate the real individual glyph bounds for type 3 fonts
            // because these are not vector fonts, the content stream could contain almost anything
            // that is found in page content streams.
            PDType3Font t3Font = (PDType3Font) font;
            PDType3CharProc charProc = t3Font.getCharProc(code);
            if (charProc != null)
            {
                BoundingBox fontBBox = t3Font.getBoundingBox();
                PDRectangle glyphBBox = charProc.getGlyphBBox();
                if (glyphBBox != null)
                {
                    // PDFBOX-3850: glyph bbox could be larger than the font bbox
                    glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                    glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                    glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                    glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                    path = glyphBBox.toGeneralPath();
                }
            }
        }
        else if (font instanceof PDVectorFont)
        {
            PDVectorFont vectorFont = (PDVectorFont) font;
            path = vectorFont.getPath(code);

            if (font instanceof PDTrueTypeFont)
            {
                PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
                int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
                at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
            }
            if (font instanceof PDType0Font)
            {
                PDType0Font t0font = (PDType0Font) font;
                if (t0font.getDescendantFont() instanceof PDCIDFontType2)
                {
                    int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                    at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                }
            }
        }
        else if (font instanceof PDSimpleFont)
        {
            PDSimpleFont simpleFont = (PDSimpleFont) font;

            // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
            // which is why PDVectorFont is tried first.
            String name = simpleFont.getEncoding().getName(code);
            path = simpleFont.getPath(name);
        }
        else
        {
            // shouldn't happen, please open issue in JIRA
            System.out.println("Unknown font class: " + font.getClass());
        }
        if (path == null)
        {
            return null;
        }
        return at.createTransformedShape(path.getBounds2D());
    }

    //
    // Bitmaps
    //
    @Override
    public void drawImage(PDImage pdImage) throws IOException {
        Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
        for (int x = 0; x < 2; x++) {
            for (int y = 0; y < 2; y++) {
                add(ctm.transformPoint(x, y));
            }
        }
    }

    //
    // Paths
    //
    @Override
    public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
        addToPath(p0, p1, p2, p3);
    }

    @Override
    public void clip(int windingRule) throws IOException {
    }

    @Override
    public void moveTo(float x, float y) throws IOException {
        addToPath(x, y);
    }

    @Override
    public void lineTo(float x, float y) throws IOException {
        addToPath(x, y);
    }

    @Override
    public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
        addToPath(x1, y1);
        addToPath(x2, y2);
        addToPath(x3, y3);
    }

    @Override
    public Point2D getCurrentPoint() throws IOException {
        return null;
    }

    @Override
    public void closePath() throws IOException {
    }

    @Override
    public void endPath() throws IOException {
        rectanglePath = null;
    }

    @Override
    public void strokePath() throws IOException {
        addPath();
    }

    @Override
    public void fillPath(int windingRule) throws IOException {
        addPath();
    }

    @Override
    public void fillAndStrokePath(int windingRule) throws IOException {
        addPath();
    }

    @Override
    public void shadingFill(COSName shadingName) throws IOException {
    }

    void addToPath(Point2D... points) {
        Arrays.asList(points).forEach(p -> addToPath(p.getX(), p.getY()));
    }

    void addToPath(double newx, double newy) {
        if (rectanglePath == null) {
            rectanglePath = new Rectangle2D.Double(newx, newy, 0, 0);
        } else {
            rectanglePath.add(newx, newy);
        }
    }

    void addPath() {
        if (rectanglePath != null) {
            add(rectanglePath);
            rectanglePath = null;
        }
    }

    void add(Rectangle2D rect) {
        if (rectangle == null) {
            rectangle = new Rectangle2D.Double();
            rectangle.setRect(rect);
        } else {
            rectangle.add(rect);
        }
    }

    void add(Point2D... points) {
        for (Point2D point : points) {
            add(point.getX(), point.getY());
        }
    }

    void add(double newx, double newy) {
        if (rectangle == null) {
            rectangle = new Rectangle2D.Double(newx, newy, 0, 0);
        } else {
            rectangle.add(newx, newy);
        }
    }

    Rectangle2D rectanglePath = null;
    Rectangle2D rectangle = null;
}

(github上的BoundingBoxFinder

如您所见,我从PDFBox示例类中借用了calculateGlyphBounds帮助方法。

用法示例

对于BoundingBoxFinder的给定PDPage pdPage,您可以像这样使用PDDocument pdDocument沿着边界框边缘画一条边界线:

void drawBoundingBox(PDDocument pdDocument, PDPage pdPage) throws IOException {
    BoundingBoxFinder boxFinder = new BoundingBoxFinder(pdPage);
    boxFinder.processPage(pdPage);
    Rectangle2D box = boxFinder.getBoundingBox();
    if (box != null) {
        try (   PDPageContentStream canvas = new PDPageContentStream(pdDocument, pdPage, AppendMode.APPEND, true, true)) {
            canvas.setStrokingColor(Color.magenta);
            canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
            canvas.stroke();
        }
    }
}

DetermineBoundingBox辅助方法)

结果如下:

Screenshot

仅概念证明

请注意,BoundingBoxFinder确实不是很复杂。特别是它不会忽略不可见的内容,例如白色背景矩形,在渲染模式下绘制的文本为“不可见”,由白色填充路径覆盖的任意内容,位图图像的白色部分,等等。此外,它确实忽略了剪切路径,很奇怪混合模式,注释,...

扩展类以正确处理这些情况很简单,但是要添加的代码总和将超出堆栈溢出答案的范围。


对于此答案中的代码,我使用了当前的PDFBox 3.0.0-SNAPSHOT开发分支,但对于当前的2.x版本,它也应该开箱即用。