Question

我在python 3中使用pdfminer，从pdf中恢复的文本中出现了奇怪的字母。

例如，我得到--app.py --posts/ ----example_post.md ----image.png --templates/ ----post.html而不是signiﬁcant（请注意，字母significant和f合并为一个）。

我不知道为什么会这样。这是我正在使用的代码。

到目前为止，我唯一的猜测是可能与编码有关，但似乎there is no way to retrieve the encoding of a pdf

Answer 1

PDFminer正常工作。有问题的字符是Unicode字符U + FB01 fi ligature。

在代码中添加一行以ﬁ替换fi：

for s in sentences:
    s = s.replace ('ﬁ', 'fi')
    print (s)

还有另一种非常常见的-纯印刷（*）-用Unicode定义的连字：U + FB02，fl连字；一样对待：

    s = s.replace ('ﬂ', 'fl')

以及Alphabetic Presentation block中的其他几个，您也可以包括在内。

（*）不犯错误，将æ更改为ae，将œ更改为oe。这些不是 “纯印刷连字”，而是有效的字符。