Question

使用PDFBox提取PDF时有没有办法保留文本格式？

我有一个程序可以解析PDF文档以获取信息。当发布新版本的PDF时，作者使用粗体或斜体文本来表示新信息，并使用Strike through或underlined来指示省略的文本。使用PDFbox中的基础剥离器类会返回所有文本，但格式化将被删除，因此我无法确定文本是新的还是省略。我目前正在使用下面的项目示例代码：

    Dim doc As PDDocument = Nothing

    Try
        doc = PDDocument.load(RFPFilePath)
        Dim stripper As New PDFTextStripper()

        stripper.setAddMoreFormatting(True)
        stripper.setSortByPosition(True)
        rtxt_DocumentViewer.Text = stripper.getText(doc)

    Finally
        If doc IsNot Nothing Then
            doc.close()
        End If
    End Try

如果我只是将PDF文本复制并粘贴到保存格式的richtextbox中，我的解析代码就可以正常工作。我打算通过打开PDF，选择全部，复制，关闭文档然后将其粘贴到我的richtextbox中，以编程方式执行此操作，但这看起来很笨拙。

Answer 1

正如OP在评论中提到的Java示例所做的那样，我还只使用了JavaBox和Java，这个答案以Java示例为特色。此外，此示例仅使用PDFBox版本1.8.11开发和测试。

自定义文字删除器

正如评论中已经提到的那样，

OP的示例文档中的粗体和斜体效果是通过使用不同的字体（包含字母的粗体或斜体）来生成的。样本文档中的下划线和透视效果是通过在文本行下方/通过文本行绘制矩形来生成的，该文本行具有文本行的宽度和非常小的高度。因此，要提取这些信息，必须将PDFTextStripper扩展为以某种方式对字体更改和文本附近的矩形作出反应。

这是一个扩展PDFTextStripper的示例类，就像那样：

public class PDFStyledTextStripper extends PDFTextStripper
{
    public PDFStyledTextStripper() throws IOException
    {
        super();
        registerOperatorProcessor("re", new AppendRectangleToPath());
    }

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        for (TextPosition textPosition : textPositions)
        {
            Set<String> style = determineStyle(textPosition);
            if (!style.equals(currentStyle))
            {
                output.write(style.toString());
                currentStyle = style;
            }
            output.write(textPosition.getCharacter());
        }
    }

    Set<String> determineStyle(TextPosition textPosition)
    {
        Set<String> result = new HashSet<>();

        if (textPosition.getFont().getBaseFont().toLowerCase().contains("bold"))
            result.add("Bold");

        if (textPosition.getFont().getBaseFont().toLowerCase().contains("italic"))
            result.add("Italic");

        if (rectangles.stream().anyMatch(r -> r.underlines(textPosition)))
            result.add("Underline");

        if (rectangles.stream().anyMatch(r -> r.strikesThrough(textPosition)))
            result.add("StrikeThrough");

        return result;
    }

    class AppendRectangleToPath extends OperatorProcessor
    {
        public void process(PDFOperator operator, List<COSBase> arguments)
        {
            COSNumber x = (COSNumber) arguments.get(0);
            COSNumber y = (COSNumber) arguments.get(1);
            COSNumber w = (COSNumber) arguments.get(2);
            COSNumber h = (COSNumber) arguments.get(3);

            double x1 = x.doubleValue();
            double y1 = y.doubleValue();

            // create a pair of coordinates for the transformation
            double x2 = w.doubleValue() + x1;
            double y2 = h.doubleValue() + y1;

            Point2D p0 = transformedPoint(x1, y1);
            Point2D p1 = transformedPoint(x2, y1);
            Point2D p2 = transformedPoint(x2, y2);
            Point2D p3 = transformedPoint(x1, y2);

            rectangles.add(new TransformedRectangle(p0, p1, p2, p3));
        }

        Point2D.Double transformedPoint(double x, double y)
        {
            double[] position = {x,y}; 
            getGraphicsState().getCurrentTransformationMatrix().createAffineTransform().transform(
                    position, 0, position, 0, 1);
            return new Point2D.Double(position[0],position[1]);
        }
    }

    static class TransformedRectangle
    {
        public TransformedRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3)
        {
            this.p0 = p0;
            this.p1 = p1;
            this.p2 = p2;
            this.p3 = p3;
        }

        boolean strikesThrough(TextPosition textPosition)
        {
            Matrix matrix = textPosition.getTextPos();
            // TODO: This is a very simplistic implementation only working for horizontal text without page rotation
            // and horizontal rectangular strikeThroughs with p0 at the left bottom and p2 at the right top

            // Check if rectangle horizontally matches (at least) the text
            if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
                return false;
            // Check whether rectangle vertically is at the right height to underline
            double vertDiff = p0.getY() - matrix.getYPosition();
            if (vertDiff < 0 || vertDiff > textPosition.getFont().getFontDescriptor().getAscent() * textPosition.getFontSizeInPt() / 1000.0)
                return false;
            // Check whether rectangle is small enough to be a line
            return Math.abs(p2.getY() - p0.getY()) < 2;
        }

        boolean underlines(TextPosition textPosition)
        {
            Matrix matrix = textPosition.getTextPos();
            // TODO: This is a very simplistic implementation only working for horizontal text without page rotation
            // and horizontal rectangular underlines with p0 at the left bottom and p2 at the right top

            // Check if rectangle horizontally matches (at least) the text
            if (p0.getX() > matrix.getXPosition() || p2.getX() < matrix.getXPosition() + textPosition.getWidth() - textPosition.getFontSizeInPt() / 10.0)
                return false;
            // Check whether rectangle vertically is at the right height to underline
            double vertDiff = p0.getY() - matrix.getYPosition();
            if (vertDiff > 0 || vertDiff < textPosition.getFont().getFontDescriptor().getDescent() * textPosition.getFontSizeInPt() / 500.0)
                return false;
            // Check whether rectangle is small enough to be a line
            return Math.abs(p2.getY() - p0.getY()) < 2;
        }

        final Point2D p0, p1, p2, p3;
    }

    final List<TransformedRectangle> rectangles = new ArrayList<>();
    Set<String> currentStyle = Collections.singleton("Undefined");
}

（PDFStyledTextStripper.java）

除PDFTextStripper所做的事外，此课程还

使用AppendRectangleToPath运算符处理器内部类的实例从内容（使用 re 指令定义）收集矩形，
从determineStyle和
每当样式发生变化时，都会在writeString。

注意：这仅仅是一个概念证明！特别是

TransformedRectangle.underlines（TextPosition）和TransformedRectangle＃strikesThrough（TextPosition）中测试的实现非常简单，仅适用于没有页面旋转的水平文本和水平矩形strikeThroughs和下划线，左下角为p0，p2位于右上;
收集所有矩形，而不是检查它们是否实际填充了可见颜色;
“粗体”和“斜体”的测试只是检查使用过的字体的名称，这通常不够。

测试输出

使用PDFStyledTextStripper这样的

String extractStyled(PDDocument document) throws IOException
{
    PDFTextStripper stripper = new PDFStyledTextStripper();
    stripper.setSortByPosition(true);
    return stripper.getText(document);
}

（来自ExtractText.java，来自测试方法testExtractStyledFromExampleDocument）

获得结果

[]This is an example of plain text 

[Bold]This is an example of bold text 
[] 
[Underline]This is an example of underlined text[] 

[Italic]This is an example of italic text  
[] 
[StrikeThrough]This is an example of strike through text[]  

[Italic, Bold]This is an example of bold, italic text

OP的样本文件

PS PDFStyledTextStripper的代码同时略有改变，也适用于github问题中共享的示例文档，特别是其内部类{{1}的代码}，比照here

使用带有VB.NET的PDFBox检测粗体，斜体和打击文本

1 个答案:

自定义文字删除器

测试输出