Question

我正在为Thunderbird邮件编写解析器。

输入： 我有一个带有大量电子邮件的文件（主要部分用ANSI-WINDOWS 1250编写，但内容是在utf-8或iso-8859-2中，它是用邮件的Content-Type标记写的。）

输出： 消息内容集（正文）。

这就是我的所作所为：

将整个文件读入byte []变量。（仍为ANSI）
将其转换为String。（utf-16但是来自ANSI的字节） - 我现在需要转换为String，因为我需要到达下一点（划分一堆消息 - ＆gt;唯一的消息）
将一堆消息分成单独的消息，并将每条消息添加到Collection（utf-16）。
检查邮件的内容类型。
使用JavaMail API我使用mail.getContent（我想是utf-16，但我不确定内部编码）。
这是我的问题：我猜一个UTF-16的字符串，它的内容是例如iso-8859-2，我现在该怎么办？

我正在使用Charset和新的String（byte []，String（charset name）），但我没有尝试过。

我的尝试：

从UTF-16转换最终字符串 - ＆gt; UTF-8（因为它与8859-2中的字节数相同）
从utf-8获取字节并将其编码为ANSI
将ANSI解码为utf-8
将utf-8编码为ISO-8859-2（或保留，如果已经是utf-8）
从ISO-8859-2解码。但它没有给我任何好结果。

我该如何处理它？对我来说解码太多了，我感到头晕目眩。

输入（这是作为cp1250文件保存，但我将其转换为utf-8，）：

  From - Thu Dec 08 15:06:14 2011
(some mail header stuff....)
Content-Type: text/html; charset="iso-8859-2"
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2"><span class="cald-word">clich&eacute;d</span> </th><td class="field1"><br>
banal; <b>banalny<b>
<br>
She made a <span class="cald-word">clich&eacute;d remark about the importance of friendship.</span>
<br>
<b>Wygԯsiԡ jakѶ banalnѠuwagꡯ wadze przyjaݮi . <br>
<b>
<b> <b><br>
</td></tr></tbody></table>
From - Thu Dec 08 15:42:09 2011
Content-Type: text/html; charset=utf-8
(some mail header stuff....)
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2">nosiness</th><td class="field1"><br>
<br>
interest in somebody else's business; <b>wścibstwo<b>
<br>
Nosiness is something I can't stand, so stop asking such questions.
<br>
<b>Nie znoszę wścibstwa, więc przestań zadawać takie pytania. <b><b> <br>
<b>
</td></tr></tbody></table>

从多编码文件中获取数据

0 个答案: