Question

我正在尝试阅读一堆PDF并只是提取文本。对于使用FlateDecode的示例文本的一半，我只使用gzuncompress，然后我得到一些我可以解析的内容来获取文本：

Tw [(remains an unresolved theoretical and pragmatic conundr)]

但是在gzuncompress()之后的其他时间我得到了类似的东西：

TD [(\002\016\032)-233.5 (\017\004\t/+\013\r\016\013\004\024\f)-233.5 
    (\b\002\017\004\032)-233.5 (\004;\024\t\002\016\002\f\n\r\016\017)-233.4
    (\r/)-233.5 (\013\022\002\023\n\017 \002\f\n\013)-233.4
    (\t\004\002\032\004\023\017\022\n\024)-233.5 (1\004\020\003\020\033)-233.5
    (\001\022\002 \n\023)]TJ

我很确定这是文字，因为我无法从PDF中获取任何其他文字，而且它位于BT ... ET

内

第二种格式是什么？如何将其转换为可读的内容？

Answer 1

您需要为文件中的每行文本数据找到CMap的字体描述符。它看起来像：

16 0 obj 
    << /Length 433 >>
    stream 
    /CIDInit /ProcSet findresource begin 
    12 dict begin b
    egincmap 
    /CIDSystemInfo 
    << /Registry (Adobe) 
    /Ordering (UCS) 
    /Supplement 0 
    >> def 
    /CMapName /Adobe−Identity−UCS def 
    /CMapType 2 def 
    1 begincodespacerange 
    <0000> <FFFF> 
    endcodespacerange 
    2 beginbfrange 
    <0000> <005E> <0020> 
    <005F> <0061> [<00660066> <00660069> <00660066006C>] 
    endbfrange 
    1 beginbfchar 
    <3A51> <D840DC3E> 
    endbfchar 
    endcmap CMapName currentdict /CMap defineresource pop end end endstream 
endobj

让我们以表格形式转换此示例：

+-----------+----------+----------+----------------------+--------------+
| write hex | or ascii | or octal |  with substitution   | and will see |
+-----------+----------+----------+----------------------+--------------+
| <5f>      | _        | \137     | U+0066 U+0066        | ff           |
| <60>      | `        | \140     | U+0066 U+0069        | fi           |
| <61>      | a        | \141     | U+0066 U+0066 U+006c | ffl          |
+-----------+----------+----------+----------------------+--------------+

因此，如果您将在字体描述符下看到带有当前CMap的文本：

TD[(\137\140\141)]TJ === fffiffl

此示例CMap包含一个替换。对于单个字符：

+-----------+----------+--------------------+-------------+
| write hex | or octal | means in UTF-16BE  | and Unicode |
+-----------+----------+--------------------+-------------+
| <3a51>    | \35121   | <D840DC3E>         | U+2003e     |
+-----------+----------+--------------------+-------------+

此替换为TD[(\35121)]TJ ===?

参考文献：

PDF Reference six edition Adobe Portable Document Format version 1.7, November 2006

为什么我会得到不同的PDF FlateDecode格式？

1 个答案: