Question

我一直在使用Python解析电子邮件。学习如何处理字符编码一直是个挑战。我正在使用本机电子邮件库来创建一个msg对象。

我首先获取重要的标头值并存储为变量。据我所知，这里没问题。这是获得主题的一个例子。其他标题值以类似的方式获取。

try:
    subject_parts = decode_header(subject_var)
    fixedsubjectLine = ' '.join(abytes.decode('raw-unicode-escape' if enc == None else enc) for (abytes, enc) in subject_parts)
except:
    fixedsubjectLine = "Subject Failed to Parse"
    print "ERROR"

然后，我会浏览电子邮件的各个部分，获取正文和附件。我将有效负载部分写入磁盘。一些解析的电子邮件具有中文字符并被编码。要使用正确的字符编写文件，我会使用文件名并将其编码为有效负载编码所指示的编码。

# get charecter encoding
char_set = part.get_content_charset()

# get the filename of the msg part, if it exist.
filename = part.get_filename() 
if decode_header(filename)[0][1] is not None:
    passed_filename = str(decode_header(filename)[0][0]).decode(decode_header(filename)[0][1])


# If there is a charecter set for the current email part
if char_set:
    filename = passed_filename.encode(char_set)
else:
    char_set = 'utf-8'
    filename = passed_filename.encode('utf-8')

try:
    filename = filename.replace("/","")
    logger.info(chardet.detect(filename))
    fp = open(os.path.join(tmpDir,filename), 'wb+')
    fp.write(part.get_payload(decode=True))
    fp.close()
except:
    print "ERROR"

对于msg.walk（）函数的每次迭代，我将dict中的值附加到列表中。

email_parts.append ({"msg" : (os.path.join(tmpDir, filename)), "subject" : fixedsubjectLine, "from" : fixedfromAdd, "to" : fixedtoAdd, "sent" : timestamp ,"encoding" : char_set})

在添加到列表之前，我检查文件名的编码，并使用日志库将其打印到日志文件中。

{'confidence': 0.99, 'encoding': 'utf-8'}
 的质量投诉 客服审核中.xlsm

然后我将列表写入文件，如下所示。

try:
    emailPartsFile = os.path.join(tmpDir, 'email_parts.txt')
    f = open(emailPartsFile,'w')
    for item in email_parts:
        f.write(str({"msg" : item['msg'], "subject" : item['subject'], "from" : item['from'], "to" : item['to'], "sent" : item['sent'] ,"encoding" : item['encoding']}))
        f.write('\n') 

    f.close()
    os.chmod(emailPartsFile, 0755)    
    os.chown(emailPartsFile,1000,1000)     
    return emailPartsFile
except:
    print "ERROR"

这给了我一个格式如下的文件。在此日志中，您可以看到从电子邮件内容手动创建的电子邮件的2部分，然后是文件附件。这是上面的文件名（的质量投诉客服审核中.xlsm）

{'from': u'john smith <johnsmith@gmail.com>', 'encoding': 'utf-8', 'to': 'johnsmith@mailserver.com', 'msg': 'part-001.html', 'sent': 'Thu, 09 May 2016 01:48:24 -0000', 'subject': u'Fwd: \u7684\u8d28\u91cf\u6295\u8bc9 \u5ba2\u670d\u5ba1\u6838\u4e2d'}
{'from': u'john smith <johnsmith@gmail.com>', 'encoding': 'utf-8', 'to': 'johnsmith@mailserver.com', 'msg': 'part-002.html', 'sent': 'Thu, 09 May 2016 01:48:24 -0000', 'subject': u'Fwd: \u7684\u8d28\u91cf\u6295\u8bc9 \u5ba2\u670d\u5ba1\u6838\u4e2d'}
{'from': u'john smith <johnsmith@gmail.com>', 'encoding': 'utf-8', 'to': 'johnsmith@mailserver.com', 'msg': '\xe7\x9a\x84\xe8\xb4\xa8\xe9\x87\x8f\xe6\x8a\x95\xe8\xaf\x89 \xe5\xae\xa2\xe6\x9c\x8d\xe5\xae\xa1\xe6\xa0\xb8\xe4\xb8\xad.xlsm', 'sent': 'Thu, 09 May 2016 01:48:24 -0000', 'subject': u'Fwd: \u7684\u8d28\u91cf\u6295\u8bc9 \u5ba2\u670d\u5ba1\u6838\u4e2d'}

将文件名写入磁盘时，is被正确解码，文件名保留正确的中文字符。

问题在于，您可以看到文本仍然是为文本文件中的文件名和标题值编码的。

要进行故障排除，我已使用日志记录库将这些值写入文件：

from logging.handlers import SysLogHandler
logger.info('|filename:'+tmpDir + '|Email_TIMESTAMP:'+sentTime + '|Created_TIMESTAMP:'+timestamp + '|TO:'+fixedtoAdd + '|FROM:'+fixedfromAdd + '|SUBJECT:' + fixedsubjectLine + '|'+ char_set+ '|')

写入日志文件时，值也会正确解码。问题似乎只是当我直接将列表写入文件时。我也试过一个简单的print语句，并且打印正确。

理想情况下，我想将中文字符写入文件。有什么想法吗？

Answer 1

写入文件时，请使用以下命令：

f = open(emailPartsFile,'wb')

...

f.write(str(<YOUR_DICT>).encode('utf8'))

打开文件时，应该有中文字符。

Python如何在列表中存储已解码的unicode字符串，然后再使用？

1 个答案: