Question

我使用PDFBox验证pdf文档。有必要检查PDF中存在的以下类型的文本

人工大胆样式文字
人工斜体文字。
人工大纲样式文字

我在PDFBOX api列表中搜索过但无法找到这样的api。

任何人都可以帮助我，并告诉我们如何使用PDFBOX确定PDF中存在的不同类型的人工字体/文本样式。

Answer 1

一般程序和PDFBox问题

理论上，我们应该从PDFTextStripper派生一个类并重写它的方法来开始这个：

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
    writeString(text);
}

然后，您的覆盖应使用List<TextPosition> textPositions代替String text;每个TextPosition基本上代表一个单个字母，并且在绘制该字母时，图形状态的信息处于活动状态。

不幸的是，textPositions列表不包含当前版本1.8.3中的正确内容。例如。对于该行＆＃34;这是普通文本。＆＃34;从您的PDF中，方法writeString被调用四次，每次一次用于字符串＆＃34;此＆＃34;，＆＃34;是＆＃34;，＆＃34;正常＆＃34;和＆＃34; 。文本＆＃34;不幸的是，textPositions列表每次都包含最后一个字符串＆＃34的字母的TextPosition个实例。。文本＆＃34;

事实证明，这已经被认为是PDFBox问题PDFBOX-1804，同时已经解决为1.8.4和2.0.0版本的固定问题。

据说，只要您修复了PDFBox版本，就可以检查以下人工样式：

人工斜体文字

此文本样式在页面内容中创建如下：

BT
/F0 1 Tf
24 0 5.10137 24 66 695.5877 Tm
0 Tr
[<03>]TJ
...

相关部分在设置文本矩阵 Tm 时发生。 5.10137是剪切文本的一个因素。

如上所示检查TextPosition textPosition时，您可以使用

查询此值

textPosition.getTextPos().getValue(1, 0)

如果此值大于0.0，则表示人为斜体。如果它相关地小于0.0，则会有人为的倒退斜体。

人工粗体或大纲文字

这些人工风格使用不同渲染模式的双重打印字母;例如首都＆＃39; T＆＃39;，如果是粗体：

0 0 0 1 k
...
BT
/F0 1 Tf 
24 0 0 24 66.36 729.86 Tm 
<03>Tj 
4 M 0.72 w 
0 0 Td 
1 Tr 
0 0 0 1 K
<03>Tj
ET

（即首先以常规模式绘制字母，填充字母区域，然后以轮廓模式绘制，沿字母边框绘制一条线，两者均为黑色，CMYK 0,0,0,1;这样就离开了给人留下更厚信的印象。）

如果是大纲：

BT
/F0 1 Tf
24 0 0 24 66 661.75 Tm
0 0 0 0 k
<03>Tj
/GS1 gs
4 M 0.288 w 
0 0 Td
1 Tr
0 0 0 1 K
<03>Tj
ET

（即首先以常规模式白色绘制字母，CMYK 0,0,0,0填充字母区域，然后以轮廓模式绘制，沿着字母边框绘制一条线，黑色，CMYK 0， 0,0,1;这会在白色字母上留下轮廓黑色的印象。）

不幸的是，PDFBox PDFTextStripper没有跟踪文本呈现模式。此外，它明确地在大致相同的位置丢弃重复的字符出现。因此，无法识别这些人工风格。

如果您确实需要这样做，则必须更改TextPosition以包含呈现模式PDFStreamEngine以将其添加到生成的TextPosition实例中，并且PDFTextStripper要不删除processTextPosition中的重复字形。

更正

我写了

不幸的是，PDFBox PDFTextStripper无法跟踪文本呈现模式。

这不完全正确，您可以使用getGraphicsState().getTextState().getRenderingMode()找到当前渲染模式。这意味着在processTextPosition期间，您确实可以使用渲染模式，并且可以尝试在某处为给定的TextPosition存储渲染模式（和颜色！）信息，例如在某些Map<TextPosition, ...>中，供以后使用。

此外，它会在大致相同的位置显式删除重复的字符。

您可以致电setSuppressDuplicateOverlappingText(false)。

来停用此功能

通过这两项更改，您应该能够进行必要的测试以检查人工粗体和轮廓。

如果您在processTextPosition的早期存储和检查样式，则可能不需要后一项更改。

如何检索渲染模式和颜色

如更正中所述，确实可以通过在processTextPosition覆盖中收集该信息来检索呈现模式和颜色信息。

对此，OP评论说

始终是抚摸和非抚摸的颜色为黑色

一开始有点令人惊讶，但在查看PDFTextStripper.properties（文本提取期间支持的运算符初始化）之后，原因变得清晰了：

# The following operators are not relevant to text extraction,
# so we can silently ignore them.
...
K
k

因此，在此上下文中忽略颜色设置操作符（尤其是本文档中的CMYK颜色操作符）！幸运的是，PageDrawer的这些运算符的实现也可以在这种情况下使用。

因此，以下概念证明显示了如何检索所有必需的信息。

public class TextWithStateStripperSimple extends PDFTextStripper
{
    public TextWithStateStripperSimple() throws IOException {
        super();
        setSuppressDuplicateOverlappingText(false);
        registerOperatorProcessor("K", new org.apache.pdfbox.util.operator.SetStrokingCMYKColor());
        registerOperatorProcessor("k", new org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor());
    }

    @Override
    protected void processTextPosition(TextPosition text)
    {
        renderingMode.put(text, getGraphicsState().getTextState().getRenderingMode());
        strokingColor.put(text, getGraphicsState().getStrokingColor());
        nonStrokingColor.put(text, getGraphicsState().getNonStrokingColor());

        super.processTextPosition(text);
    }

    Map<TextPosition, Integer> renderingMode = new HashMap<TextPosition, Integer>();
    Map<TextPosition, PDColorState> strokingColor = new HashMap<TextPosition, PDColorState>();
    Map<TextPosition, PDColorState> nonStrokingColor = new HashMap<TextPosition, PDColorState>();

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        writeString(text + '\n');

        for (TextPosition textPosition: textPositions)
        {
            StringBuilder textBuilder = new StringBuilder();
            textBuilder.append(textPosition.getCharacter())
                       .append(" - shear by ")
                       .append(textPosition.getTextPos().getValue(1, 0))
                       .append(" - ")
                       .append(textPosition.getX())
                       .append(" ")
                       .append(textPosition.getY())
                       .append(" - ")
                       .append(renderingMode.get(textPosition))
                       .append(" - ")
                       .append(toString(strokingColor.get(textPosition)))
                       .append(" - ")
                       .append(toString(nonStrokingColor.get(textPosition)))
                       .append('\n');
            writeString(textBuilder.toString());
        }
    }

    String toString(PDColorState colorState)
    {
        if (colorState == null)
            return "null";
        StringBuilder builder = new StringBuilder();
        for (float f: colorState.getColorSpaceValue())
        {
            builder.append(' ')
                   .append(f);
        }

        return builder.toString();
    }
}

使用此功能可以获得时间段＆＃39;。＆＃39;在正常文本中：

. - shear by 0.0 - 256.5701 88.6875 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

在人工粗体文字中，你得到了;

. - shear by 0.0 - 378.86 122.140015 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0
. - shear by 0.0 - 378.86002 122.140015 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

在人工斜体字中：

. - shear by 5.10137 - 327.121 156.4123 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 1.0

在人工大纲中：

. - shear by 0.0 - 357.25 190.25 - 0 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0
. - shear by 0.0 - 357.25 190.25 - 1 -  0.0 0.0 0.0 1.0 -  0.0 0.0 0.0 0.0

所以，你有，识别这些人工风格所需的所有信息。现在你只需要分析数据。

BTW，请看一下人工大胆的情况：坐标可能并不总是相同，而是非常相似。因此，测试是否需要两个文本位置对象来描述相同的位置。

Answer 2

我解决这个问题的方法是创建一个扩展PDFTextStripper类的新类并覆盖该函数：

getCharactersByArticle()

注意：PDFBox版本1.8.5

CustomPDFTextStripper类

public class CustomPDFTextStripper extends PDFTextStripper
{
    public CustomPDFTextStripper() throws IOException {
    super();
    }

    public Vector<List<TextPosition>> getCharactersByArticle(){
    return charactersByArticle;
    }
}

这样我可以解析pdf文档，然后从自定义提取函数中获取TextPosition：

 private void extractTextPosition() throws FileNotFoundException, IOException {

    PDFParser parser = new PDFParser(new FileInputStream(pdf));
    parser.parse();
    StringWriter outString = new StringWriter();
    CustomPDFTextStripper stripper = new CustomPDFTextStripper();
    stripper.writeText(parser.getPDDocument(), outString);
    Vector<List<TextPosition>> vectorlistoftps = stripper.getCharactersByArticle();
    for (int i = 0; i < vectorlistoftps.size(); i++) {
        List<TextPosition> tplist = vectorlistoftps.get(i);
        for (int j = 0; j < tplist.size(); j++) {
            TextPosition text = tplist.get(j);
            System.out.println(" String "
          + "[x: " + text.getXDirAdj() + ", y: "
          + text.getY() + ", height:" + text.getHeightDir()
          + ", space: " + text.getWidthOfSpace() + ", width: "
          + text.getWidthDirAdj() + ", yScale: " + text.getYScale() + "]"
          + text.getCharacter());
        }       
    }
}

TextPositions包含许多有关pdf文档字符的信息。

<强>输出：

String [x：168.24，y：64.15997，身高：6.061287，空间：8.9664，宽度：3.4879303，yScale：8.9664] J

String [x：171.69745，y：64.15997，身高：6.061287，空间：8.9664，宽度：2.2416077，yScale：8.9664] N

String [x：176.25777，y：64.15997，身高：6.0343876，空间：8.9664，宽度：6.4737396，yScale：8.9664] N

String [x：182.73778，y：64.15997，身高：4.214208，空间：8.9664，宽度：3.981079，yScale：8.9664] e .....

如何使用PDFBOX确定人工粗体样式，人工斜体样式和人工轮廓样式的文本

2 个答案:

一般程序和PDFBox问题

人工斜体文字

人工粗体或大纲文字

更正

如何检索渲染模式和颜色