Question

如何在Mathematica中阅读 utf-8编码文本文件？

这就是我现在正在做的事情：

text = Import["charData.txt", "Text", CharacterEncoding -> "UTF8"];

但它告诉我

$CharacterEncoding::utf8: "The byte sequence {240} could not be interpreted as a character in the UTF-8 character encoding"

等等。我不知道为什么。我相信该文件是有效的utf-8。

这是我正在尝试阅读的文件：

http://dl.dropbox.com/u/38623/charData.txt

Answer 1

简短版本：Mathematica的UTF-8功能不适用于超过16位的字符代码。如果可能，请使用UTF-16编码。但请注意，Mathematica对17位字符代码的处理通常是错误的。长版本遵循......

正如众多评论者所指出的那样，问题似乎是Mathematica支持代码大于16位的Unicode字符。引用文本文件中的第一个这样的字符是U+20B9B（），它出现在第10行。

Mathematica前端的某些版本（如64位Windows 7上的8.0.1）可以在直接输入时处理相关字符：

In[1]:= $c="";

但是如果我们尝试从Unicode创建角色，我们会遇到麻烦：

In[2]:= 134043 // FromCharacterCode

During evaluation of In[2]:= FromCharacterCode::notunicode:
A character code, which should be a non-negative integer less
than 65536, is expected at position 1 in {134043}. >>
Out[2]= FromCharacterCode[134043]

有人想知道，Mathematica 认为这个角色的代码是什么？

In[3]:= $c // ToCharacterCode
        BaseForm[%, 16]
        BaseForm[%, 2]

Out[3]= {55362,57243}
Out[4]//BaseForm= {d842, df9b}
Out[5]//BaseForm= {1101100001000010, 1101111110011011}

我们得到的两个代码恰好与该字符的UTF-16表示相匹配，而不是人们所期望的单个Unicode值。 Mathematica也可以执行逆变换：

In[6]:= {55362,57243} // FromCharacterCode

Out[6]=

那么Mathematica对这个角色的UTF-8编码的概念是什么？

In[7]:= ExportString[$c, "Text", CharacterEncoding -> "UTF8"] // ToCharacterCode
        BaseForm[%, 16]
        BaseForm[%, 2]

Out[7]= {237,161,130,237,190,155}
Out[8]//BaseForm= {ed, a1, 82, ed, be, 9b}
Out[9]//BaseForm= {11101101, 10100001, 10000010, 11101101, 10111110, 10011011}

细心的读者会发现这是角色的UTF-16编码的UTF-8 encoding。 Mathematica可以解码这个，嗯，有趣的编码吗？

In[10]:= ImportString[
           ExportString[{237,161,130,237,190,155}, "Byte"]
         , "Text"
         , CharacterEncoding -> "UTF8"
         ]

Out[10]=

是的，它可以！但是......那又怎么样？

此角色的真实 UTF-8表达式怎么样：

In[11]:= ImportString[
           ExportString[{240, 160, 174, 155}, "Byte"]
         , "Text"
         , CharacterEncoding -> "UTF8"
         ]
Out[11]= $CharacterEncoding::utf8: The byte sequence {240} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {160} could not be
interpreted as a character in the UTF-8 character encoding. >>
$CharacterEncoding::utf8: The byte sequence {174} could not be
interpreted as a character in the UTF-8 character encoding. >>
General::stop: Further output of $CharacterEncoding::utf8 will be suppressed
during this calculation. >>
ð ®

...但我们看到原始问题中报告的失败。

UTF-16怎么样？ UTF-16不在有效字符编码列表中，但"Unicode"是。由于我们已经看到Mathematica似乎使用UTF-16作为其原生格式，让我们给它一个旋转（使用带有字节顺序标记的big-endian UTF-16）：

In[12]:= ImportString[
           ExportString[
             FromDigits[#, 16]& /@ {"fe", "ff", "d8", "42", "df", "9b"}
             , "Byte"
           ]
         , "Text"
         , CharacterEncoding -> "Unicode"
         ]
Out[12]=

有效。作为一个更完整的实验，我re-encoded the cited text file从问题转换为UTF-16并成功导入它。

Mathematica文档基本上没有提及这一主题。值得注意的是，在Mathematica中提到Unicode似乎伴随着字符代码包含16位的假设。例如，参见Raw Character Encodings中对Unicode的引用。

从中得出的结论是，对于长度超过16位的代码，Mathematica对UTF-8转码的支持缺失/错误。 UTF-16，即Mathematica的明显内部格式，似乎可以正常工作。因此，如果您能够重新编码文件和，那么这是一种解决方法，您可以接受生成的字符串实际上是UTF-16格式，而不是真正的Unicode字符串。

<强>后记

写完这个回复一段时间后，我试图重新打开包含它的Mathematica笔记本。笔记本中每一个有问题的角色都被抹去了，取而代之的是胡言乱语。我想即使在Mathematica 8.0.1中也有更多的Unicode错误可以解决;）

在Mathematica中读取UTF-8编码的文本文件

1 个答案: