Question

我正在使用FileUpload服务器控件上传以前从MS Word保存（作为网页;已过滤）的HTML文档。 charset是windows-1252。该文档具有智能引号（卷曲）以及常规引号。它还有一些空白（显然），当看起来很深的是正常的TAB或SPACE以外的字符时。

在StreamReader中捕获文件内容时，这些特殊字符将转换为问号。我假设它是因为默认的encoidng是UTF-8而且文件是Unicode。

我继续使用Unicode编码创建StreamReader，然后用正确的字符替换所有不需要的字符（我实际在stackoverflow中找到的代码）。这似乎工作....只是我不能将字符串转换回UTF-8以在asp：literal中显示它。代码在那里，它应该工作....但输出（ConvertToASCII）是不可读的。

请看下面：

    protected void btnUpload_Click(object sender, EventArgs e)
    {
        StreamReader sreader;
        if (uplSOWDoc.HasFile)
        {
            try
            {
                if (uplSOWDoc.PostedFile.ContentType == "text/html" || uplSOWDoc.PostedFile.ContentType == "text/plain")
                {
                    sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode);
                    string sowText = sreader.ReadToEnd();
                    sowLiteral.Text = ConvertToASCII(sowText);
                    lblUploadResults.Text = "File loaded successfully.";
                }
                else
                    lblUploadResults.Text = "Upload failed. Just text or html files are allowed.";
            }
            catch(Exception ex)
            {
                lblUploadResults.Text = ex.Message;
            }
        }
    }

    private string ConvertToASCII(string source)
    {
        if (source.IndexOf('\u2013') > -1) source = source.Replace('\u2013', '-');
        if (source.IndexOf('\u2014') > -1) source = source.Replace('\u2014', '-');
        if (source.IndexOf('\u2015') > -1) source = source.Replace('\u2015', '-');
        if (source.IndexOf('\u2017') > -1) source = source.Replace('\u2017', '_');
        if (source.IndexOf('\u2018') > -1) source = source.Replace('\u2018', '\'');
        if (source.IndexOf('\u2019') > -1) source = source.Replace('\u2019', '\'');
        if (source.IndexOf('\u201a') > -1) source = source.Replace('\u201a', ',');
        if (source.IndexOf('\u201b') > -1) source = source.Replace('\u201b', '\'');
        if (source.IndexOf('\u201c') > -1) source = source.Replace('\u201c', '\"');
        if (source.IndexOf('\u201d') > -1) source = source.Replace('\u201d', '\"');
        if (source.IndexOf('\u201e') > -1) source = source.Replace('\u201e', '\"');
        if (source.IndexOf('\u2026') > -1) source = source.Replace("\u2026", "...");
        if (source.IndexOf('\u2032') > -1) source = source.Replace('\u2032', '\'');
        if (source.IndexOf('\u2033') > -1) source = source.Replace('\u2033', '\"');


        byte[] sourceBytes = Encoding.Unicode.GetBytes(source);
        byte[] targetBytes = Encoding.Convert(Encoding.Unicode, Encoding.ASCII, sourceBytes);
        char[] asciiChars = new char[Encoding.ASCII.GetCharCount(targetBytes, 0, targetBytes.Length)];
        Encoding.ASCII.GetChars(targetBytes, 0, targetBytes.Length, asciiChars, 0);

        string result = new string(asciiChars);

        return result;

    }

另外，正如我之前所说的，还有一些更“透明”的字符似乎对应于doc这个单词编号缩进的地方，我不知道如何捕获它们的unicode值来替换它们....所以如果你有任何提示，请告诉我。

提前多多感谢!!

Answer 1

根据StreamReader on MSDN：

StreamReader对象尝试通过查看来检测编码流的前三个字节。它会自动识别UTF-8， little-endian Unicode和big-endian 如果文件以。开头的Unicode文本适当的字节顺序标记。否则，用户提供的编码使用。

因此，如果您上传的文件字符集为windows-1252，那么您的行：

sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode);

不正确，因为文件内容不是Unicode编码的。相反，使用：

sreader = new StreamReader(uplSOWDoc.FileContent, 
                  Encoding.GetEncoding("Windows-1252"), true);

final boolean parameter is to detect the BOM。

Answer 2

sreader = new StreamReader(uplSOWDoc.FileContent, Encoding.Unicode);

恭喜，你是被“Encoding.Unicode”叮咬的百万分之一的编码员。

没有“Unicode编码”这样的东西。 Unicode是字符集，它有许多不同的编码。

Encoding.Unicode实际上是特定的编码UTF-16LE，其中字符被编码为UTF-16“代码单元”，然后每个16位代码单元以小端顺序写入字节。这是Windows NT的本机内存Unicode字符串格式，但您几乎不想使用它来读取或写入文件。作为每单元2字节的编码，它不是ASCII兼容的，并且它对于存储或在线上效率不高。

目前，UTF-8是一种用于Unicode文本的更常见的编码。但是，微软错误地将UTF-16LE称为“Unicode”仍然会让那些只想“支持Unicode”的用户感到困惑和愚弄。由于Encoding.Unicode是一种非ASCII兼容的编码，尝试以ASCII超集编码（例如UTF-8或Windows默认代码页，如1252西欧语言）读取文件会使一切都变得非常难以辨认，而不是只是非ASCII字符。

在这种情况下，存储的文件编码是Windows代码页1252.因此请阅读：

sreader= new StreamReader(uplSOWDoc.FileContent, Encoding.GetEncoding(1252));

我会离开它。不要试图“转换为ASCII”。那些聪明的引号是非常好的字符，应该像任何其他Unicode字符一样受到支持;如果您在显示智能引号时遇到问题，那么您可能也会破坏所有其他非ASCII字符。最好解决造成这种情况发生的问题，而不是仅仅针对少数常见情况试图避免它。

FileUpload服务器控件和unicode字符

2 个答案: