如何获取email.Header.decode_header以使用非ASCII字符?

时间:2015-06-18 06:10:39

标签: python unicode utf-8 character-encoding non-ascii-characters

我借用以下代码来解析电子邮件标头,另外还要添加一个标题。不可否认,我并不完全理解所有脚手架围绕什么应该直接使用 <div style="position:relative"> <div style="position: absolute; top: 118px; LEFT: 64px;"></div> </div> 模块的原因。

值得注意的是email.Headers未实例化;而是调用它的Headers函数:

decode_header

问题在于:当标题(val)包含非ASCII字符(如Ä和ä)时,我得到:

class DecodedHeader(object):
    def __init__(self, s, folder):
        self.msg=email.message_from_string(s[1])
        self.info=parseList(s[0])
        self.folder=folder

    def __getitem__(self,name):
        if name.lower()=='folder': return self.folder
        elif name.lower()=='uid': return self.info[1][3]
        elif name.lower()=='flags': return ','.join(self.info[1][1])
        elif name.lower()=='internal-date':
            ds= self.info[1][5]
            if Options.dateFormat:
                ds= time.strftime(Options.dateFormat,imaplib.Internaldate2tuple('INTERNALDATE "'+ds+'"'))
            return ds
        elif name.lower()=='size': return self.info[1][7]
        val= self.msg.__getitem__(name)
        if val==None: return None
        return self._convert(email.Header.decode_header(val),name)
    def get(self,key,default=None):
        return self.__getitem__(key)

    def _convert(self, list, name):
        l=[]
        for s, encoding in list:
            try:    
                if (encoding!=None):
                    s=unicode(s,encoding, 'replace').encode(Options.encoding,'replace')
            except Exception, e:
                print >>sys.stderr, "Encoding error", e
            l.append(s)

        res= "".join(l)
        if Options.addr and name.lower() in ('from','to', 'cc', 'return-path','reply-to' ): res=self._modifyAddr(res)
        if Options.dateFormat and name.lower() in ('date'): res = self._formatDate(res)
        return res  

其中u'\ xe4'是ä。

我尝试了一些事情:

  • 将# - - 编码:utf-8 - 添加到header.py的顶部
  • Traceback (most recent call last): File "v12.py", line 434, in <module> main() File "v12.py", line 396, in main writer.writerow(msg) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 152, in writerow return self.writer.writerow(self._dict_to_list(rowdict)) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 149, in _dict_to_list return [rowdict.get(key, self.restval) for key in self.fieldnames] File "v12.py", line 198, in get return self.__getitem__(key) File "v12.py", line 196, in __getitem__ return self._convert(email.Header.decode_header(val),name) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/email/header.py", line 76, in decode_header header = str(header) UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128) 上调用unicode(),然后再将其传递给val
  • decode_header()上调用.encode('utf-8'),然后再将其传递给val
  • decode_header()上调用.encode('ISO-8859-1'),然后再将其传递给val

上述任何一项都不令人高兴。这是什么原因?鉴于我希望如上所述保持decode_header()的使用(Header 直接实例化),我们如何确保{_ 1}成功解码非ASCII字符}?

1 个答案:

答案 0 :(得分:1)

必须正确编码标头才能进行解码。看起来val来自已经存在的消息,所以可能该消息很糟糕。该错误表明它是一个Unicode字符串,但在该点应该是一个字节字符串。 email.header的Python帮助中的示例非常简单。

下面编码两个甚至不使用相同编码的标题:

>>> import email.header
>>> h = email.header.Header(u'To: Märk'.encode('iso-8859-1'),'iso-8859-1')
>>> h.append(u'From: Jòhñ'.encode('utf8'),'utf8')
>>> h
<email.header.Header instance at 0x00559F58>
>>> s = h.encode()
>>> s
'=?iso-8859-1?q?To=3A_M=E4rk?= =?utf-8?b?RnJvbTogSsOyaMOx?='

请注意,正确编码的标头是嵌入了编码名称的字节字符串,并且不使用非ASCII字符。

这解码了他们:

>>> email.header.decode_header(s)
[('To: M\xe4rk', 'iso-8859-1'), ('From: J\xc3\xb2h\xc3\xb1', 'utf-8')]
>>> d = email.header.decode_header(s)
>>> for s,e in d:
...  print s.decode(e)
...
To: Märk
From: Jòhñ