我想提取PDF中包含的文字。这是我的代码:
import textract
doc = textract.process(r"C:\path\to\the\downloaded.pdf", encoding = 'raw_unicode_escape')
f = open('pdf_to_text.txt','wb')
f.write(doc)
这是输出:
\u53cb\u90a6\u4fdd\u96aa\u63a7\u80a1\u6709\u9650\u516c\u53f8
REAL LIFE
REAL IMPACT
A NNUA L REP ORT 2015
STOCK CODE : 1299
VISION & PURPOSE
Our Vision is to be the pre-eminent life
insurance provider in the Asia-Pacific region.
That is our service to our customers and
our shareholders.
Our Purpose is to play a leadership role in
driving economic and social development
across the region. That is our service to
societies and their people.
ABOUT AIA
AIA Group Limited and its subsidiaries (collectively \u201cAIA\u201d
or the \u201cGroup\u201d) comprise the largest independent publicly
listed pan-Asian life insurance group. It has a presence in
18 markets in Asia-Paci\ufb01c \u2013 wholly-owned branches and
subsidiaries in Hong Kong, Thailand, Singapore, Malaysia,
China, Korea, the Philippines, Australia, Indonesia, Taiwan,
... ...
... ...
... ...
可以看出,它读到了一些"幻想"文本(unicode?ascii?)正确,但不是全部。我该如何解决这个问题?
我尝试了5种编码方案 - utf-8
会产生不良结果,utf-16
会产生最糟糕的结果,将所有内容转换为难以辨认的文字,ascii
会产生不太糟糕的结果,但会留下一些字符,unicode_escape
会产生平均结果,留下相当多的字符,而raw_unicode_escape
也会产生良好的效果,但会留下一些像ascii
这样的字符。
这是我下载到本地驱动器进行分析的PDF链接:
https://www.aia.com/content/dam/group/en/docs/annual-report/aia-annual-report-2015-eng.pdf
P.S。另一个小小的无关问题是,它有时会在单词的字母之间保持间隙,例如上面文本片段中的A NNUA L REP ORT
。如何解决这个问题?
编辑:我在textract's documentation的第10页和第11页中找到了可能的编码方案选项。但其中有近百个:
Possible choices: aliases, ascii, base64_codec, big5, big5hkscs,
bz2_codec, charmap, cp037, cp1006, cp1026, cp1140, cp1250, cp1251,
cp1252, cp1253, cp1254, cp1255, cp1256, cp1257, cp1258, cp424,
cp437, cp500, cp720, cp737, cp775, cp850, cp852, cp855, cp856,
cp857, cp858, cp860, cp861, cp862, cp863, cp864, cp865, cp866,
cp869, cp874, cp875, cp932, cp949, cp950, euc_jis_2004, euc_jisx0213,
euc_jp, euc_kr, gb18030, gb2312, gbk, hex_codec, hp_roman8, hz,
idna, iso2022_jp, iso2022_jp_1, iso2022_jp_2, iso2022_jp_2004,
iso2022_jp_3, iso2022_jp_ext, iso2022_kr, iso8859_1, iso8859_10,
iso8859_11, iso8859_13, iso8859_14, iso8859_15, iso8859_16,
iso8859_2, iso8859_3, iso8859_4, iso8859_5, iso8859_6, iso8859_7,
iso8859_8, iso8859_9, johab, koi8_r, koi8_u, latin_1, mac_arabic,
mac_centeuro, mac_croatian, mac_cyrillic, mac_farsi, mac_greek,
mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish,
mbcs, palmos, ptcp154, punycode, quopri_codec, raw_unicode_escape,
rot_13, shift_jis, shift_jis_2004, shift_jisx0213, string_escape, tactis,
tis_620, undefined, unicode_escape, unicode_internal, utf_16, utf_16_be,
utf_16_le, utf_32, utf_32_be, utf_32_le, utf_7, utf_8, utf_8_sig, uu_codec,
zlib_codec
如何确定哪一个是此特定PDF中使用的?如果即便留下几个字符怎么办?或者,这些 必须 中的一个必须 编码方案,不会留下任何单一的难以理解的字符?
答案 0 :(得分:0)
这就是我解决它的方式。我使用removegarbage
函数I found here替换所有非字母数字字符。
def removegarbage(str):
# Replace one or more non-word (non-alphanumeric) chars with a space
str = re.sub(r'\W+', ' ', str)
str = str.lower()
return str
doc = removegarbage(doc.decode('raw_unicode_escape'))
如果您在基本文本编辑器(如记事本)中打开txt文件,您仍会看到这些难以理解的字符。但是如果你在控制台中打开它(或者甚至可能在高级文本编辑器中打开它?),你会看到这些字符消失了:
>>>print(doc)
'aia group limited 友邦保險控股有限公司 real life real impact a nnua l rep ort
2015 stock code 1299 vision purpose our vision is to be the pre eminent life
insurance provider in the asia pacific region that is our service to our
customers and our shareholders our purpose is to play a leadership role in
driving economic and social development across the region that is our
service to societies and their people about aia aia group limited and its
subsidiaries collectively aia or the group comprise the largest independent
publicly listed pan asian life insurance group it has a presence in 18
markets in asia pacific wholly owned branches and subsidiaries in hong kong
thailand singapore malaysia china korea the philippines australia indonesia
taiwan ... ... ...
是的,标点符号和大写字母也消失了,但这没关系,因为标点符号和大写字母对于我打算用这个提取的文本做什么并不重要。