Question

尝试使用pdfminer从pdf中提取文本时，出现以下错误：

StatusUpdate

似乎有一个无法识别的字符，并且在提取其余文本之前，一个字符会抛出错误。 utf整数大于110000.这种大多数错误都与狭窄的Python构建有关，但在这种情况下不是这样。

错误似乎出现在pdfminer中的name2unicode函数中：

ValueError: unichr() arg not in range(0x110000) (wide Python build)

我找到了令人讨厌的角色。它的unicode int远大于范围，我没有找到相应的符号。

Answer 1

pdfminer函数设置为跳过键错误（调用函数在尝试除了在键错误之后传递）但在错误超出范围时错过了错误。您可以通过更改原始功能来解决此问题，如下所示：

import re
from pdfminer.psparser import PSLiteral
from pdfminer.glyphlist import glyphname2unicode
from pdfminer.latin_enc import ENCODING
STRIP_NAME = re.compile(r'[0-9]+')
def edit_name2unicode(name):
    """Converts Adobe glyph names to Unicode numbers."""
    if name in glyphname2unicode:
        return glyphname2unicode[name]
    m = STRIP_NAME.search(name)
#     print('name: '+name)
#     print('m: '+str(m))
    if not m or m>110000:
        raise KeyError(name)
    return unichr(int(m.group(0)))

pdfminer.encodingdb.name2unicode = edit_name2unicode

最后请注意，在为整个文档导入pdfminer之后，必须将旧函数设置为新函数。这是一个运行时解决方法，对于您必须多次完成的过程，我改为更改源文档，尤其是因为pdfminer没有良好的类结构，您可以轻松地继承和覆盖。

但是，如果要保留的字符存在关键错误，可以将它们添加到pypdf glyphlist或添加另一个字符集编码here。

PDFminer错误超出无法识别的字符

1 个答案: