我正在使用pypff
处理python脚本,以打开Outlook PST文件并提取有用的信息。我正在遵循this page中发布的代码。
我正在尝试获取每封电子邮件的附件名称,但是类型为'attachment'的唯一方法是get_size()
,read_buffer()
和seek_offset()
,它们没有用给我。
read_buffer方法给出一个长字符串,类似于x00\x11\x00\x02\x01\x02\x02\x01\x03\x04\x07\x05\...
如何解码?
答案 0 :(得分:0)
您可以先尝试使用ascii
进行解码。
print((msg.get_attachment(0).read_buffer(attach_size)).decode('ascii', errors="ignore"))
我认为Microsoft正在使用多种方式对附件的不同部分进行编码,因此没有一种解码可以完美地完成。如果ascii
无法解码足够的内容,则可以尝试全部操作。对于不同的Python版本,请检查here。
# 98 encodings in python3.5/6/7
decode = ['ascii','big5','big5hkscs','cp037','cp273',
'cp424','cp437','cp500','cp720','cp737',
'cp775','cp850','cp852','cp855','cp856',
'cp857','cp858','cp860','cp861','cp862',
'cp863','cp864','cp865','cp866','cp869',
'cp874','cp875','cp932','cp949','cp950',
'cp1006','cp1026','cp1125','cp1140','cp1250',
'cp1251','cp1252','cp1253','cp1254','cp1255',
'cp1256','cp1257','cp1258','cp65001','euc_jp',
'euc_jis_2004','euc_jisx0213','euc_kr','gb2312','gbk',
'gb18030','hz','iso2022_jp','iso2022_jp_1','iso2022_jp_2',
'iso2022_jp_2004','iso2022_jp_3','iso2022_jp_ext','iso2022_kr','latin_1',
'iso8859_2','iso8859_3','iso8859_4','iso8859_5','iso8859_6',
'iso8859_7','iso8859_8','iso8859_9','iso8859_10','iso8859_11',
'iso8859_13','iso8859_14','iso8859_15','iso8859_16','johab',
'koi8_r','koi8_t','koi8_u','kz1048','mac_cyrillic',
'mac_greek','mac_iceland','mac_latin2','mac_roman','mac_turkish',
'ptcp154','shift_jis','shift_jis_2004','shift_jisx0213','utf_32',
'utf_32_be','utf_32_le','utf_16','utf_16_be','utf_16_le',
'utf_7','utf_8','utf_8_sig']
# Select the best decoder
items = []
for item in encode:
attach_size = msg.get_attachment(0).get_size()
content = (msg.get_attachment(0).read_buffer(attach_size)).decode(item, errors="ignore")
# I know 'sample_content' is in the attachment, so it's easy to see which ones can decode it.
if 'sample_content' in content:
items.append(item)
print(items)
如果您不知道内容是什么,可以尝试解决方法。例如,在循环中,您可以找到一种解码方式,该解码方式留下的数量最少为“ \ x”,因为在对内容进行编码之前,该内容看起来像是“ \ x93 \ x93 \ xfa \ x8c \ xd3 \ x1a \ xc6”。
如果有人能更好地解码附件,请在此处发表评论,谢谢。