为什么此代码无法正确提取PDF中的unicode文本?

时间:2018-06-05 02:26:53

标签: python pdf unicode encoding pdftotext

我想提取PDF中包含的文字。这是我的代码:

import textract

doc = textract.process(r"C:\path\to\the\downloaded.pdf", encoding = 'raw_unicode_escape')
f = open('pdf_to_text.txt','wb')
f.write(doc)

这是输出:

\u53cb\u90a6\u4fdd\u96aa\u63a7\u80a1\u6709\u9650\u516c\u53f8

REAL LIFE
REAL IMPACT
A NNUA L REP ORT 2015

STOCK CODE : 1299

VISION & PURPOSE
Our Vision is to be the pre-eminent life
insurance provider in the Asia-Pacific region.
That is our service to our customers and
our shareholders.
Our Purpose is to play a leadership role in
driving economic and social development
across the region. That is our service to
societies and their people.

ABOUT AIA
AIA Group Limited and its subsidiaries (collectively \u201cAIA\u201d
or the \u201cGroup\u201d) comprise the largest independent publicly
listed pan-Asian life insurance group. It has a presence in
18 markets in Asia-Paci\ufb01c \u2013 wholly-owned branches and
subsidiaries in Hong Kong, Thailand, Singapore, Malaysia,
China, Korea, the Philippines, Australia, Indonesia, Taiwan,
... ...
... ...
... ...

可以看出,它读到了一些"幻想"文本(unicode?ascii?)正确,但不是全部。我该如何解决这个问题?

我尝试了5种编码方案 - utf-8会产生不良结果,utf-16会产生最糟糕的结果,将所有内容转换为难以辨认的文字,ascii会产生不太糟糕的结果,但会留下一些字符,unicode_escape会产生平均结果,留下相当多的字符,而raw_unicode_escape也会产生良好的效果,但会留下一些像ascii这样的字符。

这是我下载到本地驱动器进行分析的PDF链接:

https://www.aia.com/content/dam/group/en/docs/annual-report/aia-annual-report-2015-eng.pdf

P.S。另一个小小的无关问题是,它有时会在单词的字母之间保持间隙,例如上面文本片段中的A NNUA L REP ORT。如何解决这个问题?

编辑:我在textract's documentation的第10页和第11页中找到了可能的编码方案选项。但其中有近百个:

Possible choices: aliases, ascii, base64_codec, big5, big5hkscs,
bz2_codec, charmap, cp037, cp1006, cp1026, cp1140, cp1250, cp1251,
cp1252, cp1253, cp1254, cp1255, cp1256, cp1257, cp1258, cp424,
cp437, cp500, cp720, cp737, cp775, cp850, cp852, cp855, cp856,
cp857, cp858, cp860, cp861, cp862, cp863, cp864, cp865, cp866,
cp869, cp874, cp875, cp932, cp949, cp950, euc_jis_2004, euc_jisx0213,
euc_jp, euc_kr, gb18030, gb2312, gbk, hex_codec, hp_roman8, hz,
idna, iso2022_jp, iso2022_jp_1, iso2022_jp_2, iso2022_jp_2004,
iso2022_jp_3, iso2022_jp_ext, iso2022_kr, iso8859_1, iso8859_10,
iso8859_11, iso8859_13, iso8859_14, iso8859_15, iso8859_16,
iso8859_2, iso8859_3, iso8859_4, iso8859_5, iso8859_6, iso8859_7,
iso8859_8, iso8859_9, johab, koi8_r, koi8_u, latin_1, mac_arabic,
mac_centeuro, mac_croatian, mac_cyrillic, mac_farsi, mac_greek,
mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish,
mbcs, palmos, ptcp154, punycode, quopri_codec, raw_unicode_escape,
rot_13, shift_jis, shift_jis_2004, shift_jisx0213, string_escape, tactis,
tis_620, undefined, unicode_escape, unicode_internal, utf_16, utf_16_be,
utf_16_le, utf_32, utf_32_be, utf_32_le, utf_7, utf_8, utf_8_sig, uu_codec,
zlib_codec

如何确定哪一个是此特定PDF中使用的?如果即便留下几个字符怎么办?或者,这些 必须 中的一个必须 编码方案,不会留下任何单一的难以理解的字符?

1 个答案:

答案 0 :(得分:0)

这就是我解决它的方式。我使用removegarbage函数I found here替换所有非字母数字字符。

def removegarbage(str):
    # Replace one or more non-word (non-alphanumeric) chars with a space
    str = re.sub(r'\W+', ' ', str)
    str = str.lower()
    return str

doc = removegarbage(doc.decode('raw_unicode_escape'))

如果您在基本文本编辑器(如记事本)中打开txt文件,您仍会看到这些难以理解的字符。但是如果你在控制台中打开它(或者甚至可能在高级文本编辑器中打开它?),你会看到这些字符消失了:

>>>print(doc)
'aia group limited 友邦保險控股有限公司 real life real impact a nnua l rep ort 
2015 stock code 1299 vision purpose our vision is to be the pre eminent life 
insurance provider in the asia pacific region that is our service to our 
customers and our shareholders our purpose is to play a leadership role in 
driving economic and social development across the region that is our 
service to societies and their people about aia aia group limited and its 
subsidiaries collectively aia or the group comprise the largest independent 
publicly listed pan asian life insurance group it has a presence in 18 
markets in asia pacific wholly owned branches and subsidiaries in hong kong 
thailand singapore malaysia china korea the philippines australia indonesia 
taiwan ... ... ...

是的,标点符号和大写字母也消失了,但这没关系,因为标点符号和大写字母对于我打算用这个提取的文本做什么并不重要。