首先,我要说我是Python的初学者。我从来没有学过这门语言,我只是想到“当它出现什么困难”时,谷歌只提供Python片段来解决我的问题。 :)
我有一堆Maildir格式的邮箱(来自旧网络主机上邮件服务器的备份),我需要从这些邮箱中提取电子邮件。 到目前为止,我发现最简单的方法是将它们转换为Thunderbird支持的mbox格式,而且似乎Python有几个用于读/写这两种格式的类。似乎很完美。
Python文档甚至让这个小代码片段完全符合我的需要:
src = mailbox.Maildir('maildir', factory=None)
dest = mailbox.mbox('/tmp/mbox')
for msg in src: #1
dest.add(msg) #2
除外,它不起作用。而这里是我完全缺乏关于Python的知识的地方。
在一些消息上,我在迭代期间得到了一个UnicodeDecodeError(也就是说,当它试图从msg
读取src
时,在行#1
上)。在其他情况下,我在尝试将msg
添加到dest
(行#2
)时收到UnicodeEncodeError。
显然,它对使用的编码做了一些错误的假设。但我不知道如何在邮箱上指定编码(就此而言,我不知道编码应该是什么,但是一旦我找到了实际指定编码的方法,我就可以想出来)。
我得到的堆栈跟踪类似于以下内容:
File "E:\Python30\lib\mailbox.py", line 102, in itervalues
value = self[key]
File "E:\Python30\lib\mailbox.py", line 74, in __getitem__
return self.get_message(key)
File "E:\Python30\lib\mailbox.py", line 317, in get_message
msg = MaildirMessage(f)
File "E:\Python30\lib\mailbox.py", line 1373, in __init__
Message.__init__(self, message)
File "E:\Python30\lib\mailbox.py", line 1345, in __init__
self._become_message(email.message_from_file(message))
File "E:\Python30\lib\email\__init__.py", line 46, in message_from_file
return Parser(*args, **kws).parse(fp)
File "E:\Python30\lib\email\parser.py", line 68, in parse
data = fp.read(8192)
File "E:\Python30\lib\io.py", line 1733, in read
eof = not self._read_chunk()
File "E:\Python30\lib\io.py", line 1562, in _read_chunk
self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
File "E:\Python30\lib\io.py", line 1295, in decode
output = self.decoder.decode(input, final=final)
File "E:\Python30\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 37: character maps to <undefined>
在UnicodeEncodeErrors上:
File "E:\Python30\lib\email\message.py", line 121, in __str__
return self.as_string()
File "E:\Python30\lib\email\message.py", line 136, in as_string
g.flatten(self, unixfrom=unixfrom)
File "E:\Python30\lib\email\generator.py", line 76, in flatten
self._write(msg)
File "E:\Python30\lib\email\generator.py", line 108, in _write
self._write_headers(msg)
File "E:\Python30\lib\email\generator.py", line 141, in _write_headers
header_name=h, continuation_ws='\t')
File "E:\Python30\lib\email\header.py", line 189, in __init__
self.append(s, charset, errors)
File "E:\Python30\lib\email\header.py", line 262, in append
input_bytes = s.encode(input_charset, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe5' in position 16:
ordinal not in range(128)
有人能帮助我吗? (对于不涉及Python的完全不同的解决方案的建议显然也是受欢迎的。我只需要一种方法来访问从这些Maildir文件中导入邮件。
更新
sys.getdefaultencoding返回'utf-8'
我上传了导致两个错误的示例消息。 This one抛出UnicodeEncodeError,this抛出UnicodeDecodeError
我尝试在Python2.6中运行相同的脚本,并改为使用TypeErrors:
File "c:\python26\lib\mailbox.py", line 529, in add
self._toc[self._next_key] = self._append_message(message)
File "c:\python26\lib\mailbox.py", line 665, in _append_message
offsets = self._install_message(message)
File "c:\python26\lib\mailbox.py", line 724, in _install_message
self._dump_message(message, self._file, self._mangle_from_)
File "c:\python26\lib\mailbox.py", line 220, in _dump_message
raise TypeError('Invalid message type: %s' % type(message))
TypeError: Invalid message type: <type 'instance'>
答案 0 :(得分:4)
在Python 2.5或2.6而不是3.0中试用它。 3.0具有完全不同的Unicode处理,并且此模块可能尚未针对3.0进行更新。
答案 1 :(得分:4)
注意
@ Jimmy2Times可能非常正确地说这个模块可能没有为3.0更新。
这不是一个答案,特别是对可能的解释,为什么,如何重现它,其他人可以从中受益。我正在努力完成这个答案。
我已经找到了我能找到的任何内容编辑
=====
我认为这就是正在发生的事情
在您的数据中的许多其他字符中,您有两个字符 - \x9d
和\xe5
,这些字符以某种编码格式编码,例如iso-8859-1
。
当Python 3.0找到编码字符串时,它首先尝试猜测字符串的编码,然后使用猜测编码将其解码为unicode(它保持编码unicode字符串的方式 - Link)。
我认为它的猜测部分是出错的地方。
显示最有可能发生的事情 -
假设编码为iso-8859-1
,错误的推测为cp1252
(从第一次回溯开始)。
\x9d
的解码失败。
In [290]: unicode(u'\x9d'.encode('iso-8859-1'), 'cp1252')
---------------------------------------------------------------------------
<type 'exceptions.UnicodeDecodeError'> Traceback (most recent call last)
/home/jv/<ipython console> in <module>()
/usr/lib/python2.5/encodings/cp1252.py in decode(self, input, errors)
13
14 def decode(self,input,errors='strict'):
---> 15 return codecs.charmap_decode(input,errors,decoding_table)
16
17 class IncrementalEncoder(codecs.IncrementalEncoder):
<type 'exceptions.UnicodeDecodeError'>: 'charmap' codec can't decode byte 0x9d in position 0: character maps to <undefined>
\xe5
的解码通过,但是当从某个地方从Python检索到消息时,它正试图在ascii
中对其进行编码而失败
In [291]: unicode(u'\xe5'.encode('iso-8859-1'), 'cp1252').encode('ascii')
---------------------------------------------------------------------------
<type 'exceptions.UnicodeEncodeError'> Traceback (most recent call last)
/home/jv/<ipython console> in <module>()
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\xe5' in position 0: ordinal not in range(128)
============
修改强>:
你的问题都在第2行。它首先解码为unicode然后编码为ascii
首先执行 easy_install chardet
解码错误:
In [75]: decd=open('jalf_decode_err','r').read()
In [76]: chardet.detect(decd)
Out[76]: {'confidence': 0.98999999999999999, 'encoding': 'utf-8'}
##this is what is tried at the back - my guess :)
In [77]: unicode(decd, 'cp1252')
---------------------------------------------------------------------------
<type 'exceptions.UnicodeDecodeError'> Traceback (most recent call last)
/home/jv/<ipython console> in <module>()
/usr/lib/python2.5/encodings/cp1252.py in decode(self, input, errors)
13
14 def decode(self,input,errors='strict'):
---> 15 return codecs.charmap_decode(input,errors,decoding_table)
16
17 class IncrementalEncoder(codecs.IncrementalEncoder):
<type 'exceptions.UnicodeDecodeError'>: 'charmap' codec can't decode byte 0x9d in position 2812: character maps to <undefined>'
##this is a FIX- this way all your messages r accepted
In [78]: unicode(decd, chardet.detect(decd)['encoding'])
Out[78]: u'Return-path: <root@apps2.servage.net>\nEnvelope-to: public@jalf.dk\nDelivery-date: Fri, 22 Aug 2008 16:49:53 -0400\nReceived: from [77.232.66.102] (helo=apps2.servage.net)\n\tby c1p.hostingzoom.com with esmtp (Exim 4.69)\n\t(envelope-from <root@apps2.servage.net>)\n\tid 1KWdZu-0003VX-HP\n\tfor public@jalf.dk; Fri, 22 Aug 2008 16:49:52 -0400\nReceived: from apps2.servage.net (apps2.servage.net [127.0.0.1])\n\tby apps2.servage.net (Postfix) with ESMTP id 4A87F980026\n\tfor <public@jalf.dk>; Fri, 22 Aug 2008 21:49:46 +0100 (BST)\nReceived: (from root@localhost)\n\tby apps2.servage.net (8.13.8/8.13.8/Submit) id m7MKnkrB006225;\n\tFri, 22 Aug 2008 21:49:46 +0100\nDate: Fri, 22 Aug 2008 21:49:46 +0100\nMessage-Id: <200808222049.m7MKnkrB006225@apps2.servage.net>\nTo: public@jalf.dk\nSubject: =?UTF-8?B?WW5ncmVzYWdlbnMgTnloZWRzYnJldiAyMi44LjA4?=\nFrom: Nyhedsbrev fra Yngresagen <info@yngresagen.dk>\nReply-To: info@yngresagen.dk\nContent-type: text/plain; charset=UTF-8\nX-Abuse: Servage.net Listid 16329\nMime-Version: 1.0\nX-mailer: Servage Maillist System\nX-Spam-Status: No, score=0.1\nX-Spam-Score: 1\nX-Spam-Bar: /\nX-Spam-Flag: NO\nX-ClamAntiVirus-Scanner: This mail is clean\n\n\nK\xe6re medlem\n\nH\xe5ber du har en god sommer og er klar p\xe5 at l\xe6se seneste nyt i Yngresagen. God forn\xf8jelse!\n\n\n::. KOM TIL YS-CAF\xc8 .::\nFlere og billigere ungdomsboliger, afskaf 24-\xe5rs-reglen eller hvad synes du? Yngresagen indbyder dig til en \xe5ben debat over kaffe og snacks. Yngresagens Kristian Lauta, Mette Marb\xe6k, og formand Steffen M\xf8ller fort\xe6ller om tidligere projekter og vil gerne diskutere, hvad Yngresagen skal bruge sin tid p\xe5 fremover. \nVil du diskutere et emne, du br\xe6nder for, eller vil du bare v\xe6re med p\xe5 en lytter?\nS\xe5 kom torsdag d. 28/8 kl. 17-19, Kulturhuset 44, 2200 KBH N \n \n::. VIND GAVEKORT & BLIV H\xd8RT .:: \nYngresagen har lavet et sp\xf8rgeskema, s\xe5 du har direkte mulighed for at sige din mening, og v\xe6re med til at forme Yngresagens arbejde. Brug 5 min. p\xe5 at dele dine holdninger om f.eks. uddannelse, arbejde og unges vilk\xe5r - og vind et gavekort til en musikbutik. Vi tr\xe6kker lod blandt alle svarene og finder tre heldige vindere. Sp\xf8rgeskemaet er her: www.yngresagen.dk\n\n::. YS SPARKER NORDJYLLAND I GANG .::\nNordjylland bliver Yngresagens sunde region. Her er regionsansvarlig Andreas M\xf8ller Stehr ved at starte tre projekter op: 1) L\xf8beklub, 2) F\xf8rstehj\xe6lpskursus, 3) Mad til unge-program.\nVi har brug for flere frivillige til at sparke projekterne i gang. Vi tilbyder gratis fede aktiviteter, gratis t-shirts og ture til K\xf8benhavn, hvor du kan m\xf8de andre unge i YS. Har det fanget din interesse, s\xe5 t\xf8v ikke med at kontakte os: nordjylland@yngresagen.dk tlf. 21935185. \n\n::. YNGRESAGEN I PRESSEN .::\nL\xe6s her et udsnit af sidste nyt om Yngresagen i medierne. L\xe6s og lyt mere p\xe5 hjemmesiden under \u201dYS i pressen\u201d.\n\n:: Radionyhederne: Unge skal informeres bedre om l\xe5n \nUnge ved for lidt om at l\xe5ne penge. Det udnytter banker og rejseselskaber til at give dem l\xe5n med t\xe5rnh\xf8je renter. S\xe5dan lyder det fra formand Steffen M\xf8ller fra landsforeningen Yngresagen. \n\n:: Danmarks Radio P1: Dansk Folkeparti - de \xe6ldres parti? \nHvorfor er det kun fattige \xe6ldre og ikke alle fattige, der kan s\xf8ge om at f\xe5 nedsat medielicens?\nDansk Folkepartis ungeordf\xf8rer, Karin N\xf8dgaard, og Yngresagens formand Steffen M\xf8ller debatterer medielicens, \xe6ldrecheck og indflydelse til unge \n\n:: Frederiksborg Amts Avis: Turen til Roskilde koster en holdning!\nFor at skabe et m\xf8de mellem politikere og unge fragter Yngresagen unge gratis til \xe5rets Roskilde Festival. Det sker med den s\xe5kaldte Yngrebussen, der kan l\xe6ses mere om p\xe5 www.yngrebussen.dk\n\n \n \nMed venlig hilsen \nYngresagen\n\nLandsforeningen Yngresagen\nKulturhuset Kapelvej 44\n2200 K\xf8benhavn N\n\ntlf. 29644960\ninfo@yngresagen.dk\nwww.yngresagen.dk\n\n\n-------------------------------------------------------\nUnsubscribe Link: \nhttp://apps.corecluster.net/apps/ml/r.php?l=16329&e=public%40jalf.dk%0D%0A&id=40830383\n-------------------------------------------------------\n\n'
现在它的unicode所以不应该给你任何问题。
现在编码问题:这是一个问题
In [129]: encd=open('jalf_encode_err','r').read()
In [130]: chardet.detect(encd)
Out[130]: {'confidence': 0.78187650822865284, 'encoding': 'ISO-8859-2'}
#even after the unicode conversion the encoding to ascii fails - because the criteris is strict by default
In [131]: unicode(encd, chardet.detect(encd)['encoding']).encode('ascii')
---------------------------------------------------------------------------
<type 'exceptions.UnicodeEncodeError'> Traceback (most recent call last)
/home/jv/<ipython console> in <module>()
<type 'exceptions.UnicodeEncodeError'>: 'ascii' codec can't encode character u'\u0159' in position 557: ordinal not in range(128)'
##changing the criteria to ignore
In [132]: unicode(encd, chardet.detect(encd)['encoding']).encode('ascii', 'ignore')
Out[132]: 'Return-path: <info@kollegierneskontor.dk>\nEnvelope-to: alf@5elements.net\nDelivery-date: Tue, 21 Aug 2007 06:10:08 -0400\nReceived: from pfepc.post.tele.dk ([195.41.46.237]:52065)\n\tby c1p.hostingzoom.com with esmtp (Exim 4.66)\n\t(envelope-from <info@kollegierneskontor.dk>)\n\tid 1INQgX-0003fI-Un\n\tfor alf@5elements.net; Tue, 21 Aug 2007 06:10:08 -0400\nReceived: from local.com (ns2.datadan.dk [195.41.7.21])\n\tby pfepc.post.tele.dk (Postfix) with SMTP id ADF4C8A0086\n\tfor <alf@5elements.net>; Tue, 21 Aug 2007 12:10:04 +0200 (CEST)\nFrom: "Kollegiernes Kontor I Kbenhavn" <info@kollegierneskontor.dk>\nTo: "Jesper Alf Dam" <alf@5elements.net>\nSubject: Fornyelse af profil\nDate: Tue, 21 Aug 2007 12:10:03 +0200\nX-Mailer: Dundas Mailer Control 1.0\nMIME-Version: 1.0\nContent-Type: Multipart/Alternative;\n\tboundary="Gark=_20078211010346yhSD0hUCo"\nMessage-Id: <20070821101004.ADF4C8A0086@pfepc.post.tele.dk>\nX-Spam-Status: No, score=0.0\nX-Spam-Score: 0\nX-Spam-Bar: /\nX-Spam-Flag: NO\nX-ClamAntiVirus-Scanner: This mail is clean\n\n\n\n--Gark=_20078211010346yhSD0hUCo\nContent-Type: text/plain; charset=ISO-8859-1\nContent-Transfer-Encoding: Quoted-Printable\n\nHej Jesper Alf Dam=0D=0A=0D=0AHusk at forny din profil hos KKIK inden 28.=\n august 2007=0D=0ALog ind p=E5 din profil og benyt ikonet "forny".=0D=0A=0D=\n=0AVenlig hilsen=0D=0AKollegiernes Kontor i K=F8benhavn=0D=0A=0D=0Ahttp:/=\n/www.kollegierneskontor.dk/=0D=0A=0D=0A\n\n--Gark=_20078211010346yhSD0hUCo\nContent-Type: text/html; charset=ISO-8859-1\nContent-Transfer-Encoding: Quoted-Printable\n\n<html>=0D=0A<head>=0D=0A=0D=0A<style>=0D=0ABODY, TD {=0D=0Afont-family: v=\nerdana, arial, helvetica; font-size: 12px; color: #666666;=0D=0A}=0D=0A</=\nstyle>=0D=0A=0D=0A<title></title>=0D=0A=0D=0A</head>=0D=0A<body bgcolor=3D=\n#FFFFFF>=0D=0A<hr size=3D1 noshade>=0D=0A<table cellpadding=3D0 cellspaci=\nng=3D0 border=3D0 width=3D100%>=0D=0A<tr><td >=0D=0AHej Jesper Alf Dam<br=\n><br>Husk at forny din profil inden 28. august 2007<br>=0D=0ALog ind p=E5=\n din profil og benyt ikonet "forny".=0D=0A<br><br>=0D=0A<a href=3D"http:/=\n/www.kollegierneskontor.dk/">Klik her</a> for at logge ind.<br><br>Venlig=\n hilsen<br>Kollegiernes Kontor i K=F8benhavn=0D=0A</td></tr>=0D=0A</table=\n>=0D=0A<hr size=3D1 noshade>=0D=0A</body>=0D=0A</html>=0D=0A\n\n--Gark=_20078211010346yhSD0hUCo--\n\n'
In [133]: len(encd)
Out[133]: 2303
In [134]: len(unicode(encd, chardet.detect(encd)['encoding']).encode('ascii', 'ignore'))
Out[134]: 2302
注意:正如您所看到的,此过程中可能会有轻微到中度的数据丢失。因此,用户是否使用它。
所以代码看起来像
import chardet
for msg in src:
msg=unicode(msg, chardet.detect(msg)['encoding']).encode('ascii', 'ignore')
dest.add(msg)