使用多种编码解析电子邮件主题的正则表达式

时间:2013-03-04 04:24:18

标签: regex parsing email

有!

我想在一个Mail-Subject中匹配所有内联编码,并在utf8中构建主题字符串。

一些例子:

[Listname | Topic123] =?utf-8?Q?encodedtext?=
=?iso-8859-1?q?this=20is=20some=20text?=
Klartext-Betreff
[Listname | Topic123] =?utf-8?Q?encodedtext?= =?iso-8859-1?q?this=20is=20some=20text?=
=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

我还收到了两封不同编码的邮件(例如最后一行)。

在电子邮件中,也可以将主题分成多行,每行(第一行除外)以至少一个空格开头

所以我正在寻找一个正则表达式,解析:

零件+

Part是以下之一:

  • 带空格的文字
  • =?字符集?编码?编码文本?=

我认为它会像以下一样:

ENC = (=\?)([A-Za-z0-9-]*)(\?)([A-Za-z0-9-]*)(?)([Any Character])(\?=)
Part = any character that doesnt match to ENC or ENC

1 个答案:

答案 0 :(得分:0)

function decode ($string, $source_enc, $dest_enc)
{
    $parts = preg_split (
        '/=\?([^?]+)\?([^?]+)\?([^?]+)\?=/', 
        $string, 
        -1, PREG_SPLIT_DELIM_CAPTURE);

    $result = "";

    for ($i = 0; $i < count ($parts); $i++)
    {
        $part = $parts [$i];

        if ($i % 4 == 0)
            $result .= iconv ($source_enc, $dest_enc, $part);
        else
        {
            $charset = $parts [$i++];
            $encoding = $parts [$i++];
            $text = $parts [$i];

            if ($encoding == 'Q' || $encoding == 'q')
                $text = quoted_printable_decode ($text);
            else if ($encoding == 'B' || $encoding == 'b')
                $text = base64_decode ($text);

            $result .= iconv ($charset, $dest_enc, $text);
        }
    }

    return $result;
}

echo (decode ("=?utf-8?Q?encodedtext?= =?iso-8859-1?q?this=20is=20some=20text?=
=?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
    =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=", 
    "ISO-8859-1", "ISO-8859-1"));

我的输出是:

encodedtext this is some text If you can read this yo u understand the example.