UTF-8为Thunderbird引用了可打印的多线主题?

时间:2015-07-04 14:54:16

标签: utf-8 multiline thunderbird email-headers quoted-printable

我想说我想用UTF-8撰写一个电子邮件标题,引用可打印的编码主题是" test — UNIX-утилита для проверки типа файла и сравнения значений"。我可以使用以下方法确认字符的字节:

$ echo "UNIX-утилита ..." | perl utfinfo.pl
Got 16 uchars
Char: 'U' u: 85 [0x0055] b: 85 [0x55] n: LATIN CAPITAL LETTER U [Basic Latin]
Char: 'N' u: 78 [0x004E] b: 78 [0x4E] n: LATIN CAPITAL LETTER N [Basic Latin]
Char: 'I' u: 73 [0x0049] b: 73 [0x49] n: LATIN CAPITAL LETTER I [Basic Latin]
Char: 'X' u: 88 [0x0058] b: 88 [0x58] n: LATIN CAPITAL LETTER X [Basic Latin]
Char: '-' u: 45 [0x002D] b: 45 [0x2D] n: HYPHEN-MINUS [Basic Latin]
Char: 'у' u: 1091 [0x0443] b: 209,131 [0xD1,0x83] n: CYRILLIC SMALL LETTER U [Cyrillic]
Char: 'т' u: 1090 [0x0442] b: 209,130 [0xD1,0x82] n: CYRILLIC SMALL LETTER TE [Cyrillic]
Char: 'и' u: 1080 [0x0438] b: 208,184 [0xD0,0xB8] n: CYRILLIC SMALL LETTER I [Cyrillic]
...

所以,我试图获得UTF-8,引用它的可打印表示。例如,使用Python quopri

$ python -c 'import quopri; a="test — UNIX-утилита для проверки типа файла и сравнения значений"; print(quopri.encodestring(a));'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9

...或PHP' quoted_printable_encode,它提供完全相同的输出:

$ php -r '$a="test — UNIX-утилита для проверки типа файла и сравнения значений"; echo quoted_printable_encode($a)."\n";'
test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9

因此,为了测试,我创建了一个名为test.eml的文本文件,并尝试将此输出简单地包装在=?UTF-8?Q? ... ?=标记中Subject:行,确保行结尾为CRLF \r\n

Message-Id: <4c428d27a41043e2b2b07e@example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hello world

...但是如果我在Thunderbird中打开它,我会得到一个损坏的输出:

test1

我已经读过某个地方,长标题字段中的多行由RFC0822&#34; LONG HEADER FIELDS&#34;覆盖,基本上,行结尾后面应跟一个空格。所以我用一个空格缩进延续线:

Message-Id: <4c428d27a41043e2b2b07e@example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=
 =D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=
 =D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=
 =D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hello world

......我在Thunderbird中得到了一个略有不同的主题,但仍然腐败:

test2.png

现在,如果我从前三个续行中删除=\r\n,那么主题全部在一行中:

Message-Id: <4c428d27a41043e2b2b07e@example.com>
Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hello world

...然后Thunderbird实际上显示了主题:

test3.png

...但是我的标题与RFC 2822 - 2.1.1. Line Length Limits的建议相冲突,该建议说&#34;每行字符必须不超过998个字符,且不应超过78个字符,不包括CRLF。&#34 ;;特别是78个字符的行限制。

那么,我如何获得UTF-8 Subject头字符串的正确的多行引用可打印表示,因此我可以在78个字符的.eml文件中使用它 - 并且正确读取Thunderbird它?

2 个答案:

答案 0 :(得分:1)

当我要求python创建一个包含该主题的电子邮件时,这就是它的作用:

# Set up
from boto.mws.connection import MWSConnection
MWSConnection._parse_response = lambda s, x, y, z: z

# Usage
result = az.get_matching_product_for_id(MarketplaceId="ATVPDKIKX0DER",
                                            SearchIndex="Books",
                                            IdType="ASIN",
                                            IdList=[0439023521])
# <?xml version="1.0"?>\n<GetMatchingProductForIdResponse xmlns...

因此它使用base64编码而不是quoted-printable,但基于此,我的强烈怀疑是答案是每一行必须开始和结束转义。

事实上:

$ python
Python 2.7.9 (default, Mar  1 2015, 18:22:53)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from email.message import Message
>>> from email.header import Header
>>> msg = Message()
>>> import quopri
>>> h = Header(quopri.decodestring('test =E2=80=94 UNIX-'
     '=D1=83=D1=82=D0=B8=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F'
     '=D0=BF=D1=80=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8'
     '=D0=BF=D0=B0 =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8'
     '=D1=81=D1=80=D0=B0=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F '
     '=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?='), 'UTF-8')
>>> msg['Subject'] = h
>>> print msg.as_string()
Subject: =?utf-8?b?dGVzdCDigJQgVU5JWC3Rg9GC0LjQu9C40YLQsCDQtNC70Y8g0L/RgNC+0LI=?=
 =?utf-8?b?0LXRgNC60Lgg0YLQuNC/0LAg0YTQsNC50LvQsCDQuCDRgdGA0LDQstC90LU=?=
 =?utf-8?b?0L3QuNGPINC30L3QsNGH0LXQvdC40Lk/?=


>>>

编辑:但是,即使上面添加了.eml文件,Thunderbird也会再次失败:

test4,png

...但是这次它表明它的一些字符是正确的。事实上,在“在角色中间”断线的地方发生破损;如果对于字符>>> import email >>> s = '''Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8=D0?= ... =?UTF-8?Q?=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80=D0?= ... =?UTF-8?Q?=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?= ... =?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0=D0?= ... =?UTF-8?Q?=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0=D1?= ... =?UTF-8?Q?=87=D0=B5=D0=BD=D0=B8=D0=B9?= ... ... Hello. ... ''' >>> e = email.message_from_string(s.replace('\n', '\r\n')) >>> email.header.decode_header(e['Subject']) [('test \xe2\x80\x94 UNIX-\xd1\x83\xd1\x82\xd0\xb8\xd0\xbb\xd0\xb8\xd1\x82\xd0\xb0 \xd0\xb4\xd0\xbb\xd1\x8f \xd0\xbf\xd1\x80\xd0\xbe\xd0\xb2\xd0\xb5\xd1\x80\xd0\xba\xd0\xb8 \xd1\x82\xd0\xb8\xd0\xbf\xd0\xb0 \xd1\x84\xd0\xb0\xd0\xb9\xd0\xbb\xd0\xb0 \xd0\xb8 \xd1\x81\xd1\x80\xd0\xb0\xd0\xb2\xd0\xbd\xd0\xb5\xd0\xbd\xd0\xb8\xd1\x8f \xd0\xb7\xd0\xbd\xd0\xb0\xd1\x87\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb9', 'utf-8')] >>> decoded = email.header.decode_header(e['Subject']) >>> print decoded[0][0].decode(decoded[0][1]) test — UNIX-утилита для проверки типа файла и сравнения значений 0xD1,对于字符,0x83结束一行,=D1?=启动另一行,那么Thunderbird无法解析。因此,在手动重新排列后,可以获得以下代码段:

Q?=83

...在Thunderbird中打开Message-Id: <4c428d27a41043e2b2b07e@example.com> Subject: =?UTF-8?Q?test =E2=80=94 UNIX-=D1=83=D1=82=D0=B8?= =?UTF-8?Q?=D0=BB=D0=B8=D1=82=D0=B0 =D0=B4=D0=BB=D1=8F =D0=BF=D1=80?= =?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8 =D1=82=D0=B8=D0=BF=D0=B0?= =?UTF-8?Q? =D1=84=D0=B0=D0=B9=D0=BB=D0=B0 =D0=B8 =D1=81=D1=80=D0=B0?= =?UTF-8?Q?=D0=B2=D0=BD=D0=B5=D0=BD=D0=B8=D1=8F =D0=B7=D0=BD=D0=B0?= =?UTF-8?Q?=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?= Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Hello world 消息(与OP中的this image相同)。

EDIT2:PHP似乎也是正确的,这个.eml的调用(可直接粘贴在mb_encode_mimeheader文件中):

.eml

答案 1 :(得分:1)

test.eml的问题在于您的RFC2047编码已损坏。 Q编码基于 quoted-printable,但不完全相同。特别是,每个空格都需要编码为=20_,并且您无法使用最终=转义换行符。

从根本上说,每个=?...?=序列需要是RFC 822的单个,明确的令牌。您可以将输入分解为多个此类令牌,​​并将空格保留为未编码,或者对空格进行编码。请注意,两个此类标记之间的空格并不重要,因此将空格编码到序列中会更有意义。

Message-Id: <4c428d27a41043e2b2b07e@example.com>
Subject: =?UTF-8?Q?test_=E2=80=94_UNIX-=D1=83=D1=82=D0=B8=D0=BB?=
 =?UTF-8?Q?=D0=B8=D1=82=D0=B0_=D0=B4=D0=BB_=D1=8F_=D0=BF=D1=80?=
 =?UTF-8?Q?=D0=BE=D0=B2=D0=B5=D1=80=D0=BA=D0=B8_=D1=82=D0=B8=D0=BF?=
 =?UTF-8?Q?=D0=B0_=D1=84=D0=B0=D0=B9=D0=BB=D0=B0_=D0=B8_=D1=81?=
 =?UTF-8?Q?=D1=80=D0=B0=D0=B2=D0=BD_=D0=B5=D0=BD=D0=B8=D1=8F_?=
 =?UTF-8?Q?=D0=B7=D0=BD=D0=B0=D1=87=D0=B5=D0=BD=D0=B8=D0=B9?=
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Hello world

当然,通过这个展示,引用可打印根本不易读,并且可能比base64占用更多空间,所以您可能更愿意在最后使用B编码所有

除非您自己编写MIME库,否则简单的解决方案就是不关心,让图书馆为您拼凑这些内容。 PHP更有问题(标准库缺少此功能,第三方库有些不平衡 - 找到您信任的,并坚持使用它),但在Python中,只需传入一个Unicode字符串,{{1如有必要,库将对其进行编码。