Question

我对文件编码有点困惑。我想改变它。这是我的代码：

<!DOCTYPE html>
<html>
    <head>
        <script>
            window.onload = function() {
                document.getElementById('id').innerHTML = "test";
            };
        </script>

    </head>

    <body>
        <div id="id" >blub</div>
    </body>
</html>

在Program.cs中我有代码：

public class ChangeFileEncoding
    {
        private const int BUFFER_SIZE = 15000;

        public static void ChangeEncoding(string source, Encoding destinationEncoding)
        {
            var currentEncoding = GetFileEncoding(source);
            string destination = Path.GetDirectoryName(source) +@"\"+ Guid.NewGuid().ToString() + Path.GetExtension(source);
            using (var reader = new StreamReader(source, currentEncoding))
            {
                using (var writer =new StreamWriter(File.OpenWrite(destination),destinationEncoding ))
                {
                    char[] buffer = new char[BUFFER_SIZE];
                    int charsRead;
                    while ((charsRead = reader.Read(buffer, 0, buffer.Length)) > 0)
                    {
                        writer.Write(buffer, 0, charsRead);                        
                    }
                }
            }
            File.Delete(source);
            File.Move(destination, source);
        }

        public static Encoding GetFileEncoding(string srcFile)
        {
            using (var reader = new StreamReader(srcFile))
            {
                reader.Peek();
                return reader.CurrentEncoding;
            }
        }
    }

我控制台中打印的文字是：

Unicode（UTF-8）

Unicode（UTF-8）

为什么文件的编码没有改变？我在改变文件的编码时错了吗？

此致

Answer 1

StreamReader类在未在其构造函数中传递Encoding时，将尝试自动检测文件的编码。当文件以BOM开头时，它会很好地执行此操作（并且您应该在更改文件的编码时编写前导码，以便在下次要读取文件时进行此操作）。

正确检测文本文件的编码是一个难题，特别是对于没有BOM的非Unicode文件或Unicode文件。读者（无论是StreamReader，Notepad ++还是任何其他读者）都必须猜测文件中使用的是哪种编码。

另见How can I detect the encoding/codepage of a text file，强调我的：

您无法检测到代码页，需要告诉它。您可以分析字节并猜测它，但这可能会产生一些奇怪的（有时是有趣的）结果。

因为ASCII（字符0-127）是Unicode的子集，所以使用单字节Unicode编码（UTF-8）读取ASCII文件是安全的。因此StreamReader使用该编码。

也就是说，只要它是真正的ASCII。代码点127上方的任何字符都将是ANSI，然后您就可以轻松地检测猜测正确的代码页。

所以回答你的问题：你已经更改了文件的编码，根本就没有万无一失的“检测”它，你只能猜测它。

必读材料：The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)和Unicode, UTF, ASCII, ANSI format differences。

Answer 2

使用StreamReader.CurrentEncoding检测有点棘手，因为这不会说明文件使用的编码，但StreamReader需要读取的编码。基本上，没有简单的方法来检测编码如果没有BOM而没有读取整个文件（并分析你在那里找到的东西，这不是微不足道的。）

对于带有BOM的文件，很容易：

public static Encoding GetFileEncoding(string srcFile)
{
   var bom = new byte[4];
   using (var f = new FileStream(srcFile, FileMode.Open, FileAccess.Read))
     f.Read(bom, 0, 4);

   if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
   if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
   if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode;
   if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode;
   if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
   // No BOM, so you choose what to return... the usual would be returning UTF8 or ASCII
   return Encoding.UTF8;
}

c＃获取anc更改文件编码

2 个答案: