查找任何文件编码的有效方法

时间:2010-09-29 20:02:43

标签: c# encoding

是的,这是一个最常见的问题,这个问题对我来说很模糊,因为我不太了解它。

但我想要一种非常精确的方法来查找文件编码。 如Notepad ++那么精确。

10 个答案:

答案 0 :(得分:126)

StreamReader.CurrentEncoding属性很少为我返回正确的文本文件编码。通过分析字节顺序标记(BOM),我在确定文件字节序方面取得了更大的成功:

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
    // Read the BOM
    var bom = new byte[4];
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        file.Read(bom, 0, 4);
    }

    // Analyze the BOM
    if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
    if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
    if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
    if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
    if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
    return Encoding.ASCII;
}

作为旁注,您可能希望修改此方法的最后一行以返回Encoding.Default,因此默认情况下会返回操作系统当前ANSI代码页的编码。

答案 1 :(得分:39)

以下代码适用于我,使用StreamReader类:

  using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
  {
      reader.Peek(); // you need this!
      var encoding = reader.CurrentEncoding;
  }

诀窍是使用Peek调用,否则,.NET没有做任何事情(并且它没有读取前导码,BOM)。当然,如果您在检查编码之前使用任何其他ReadXXX调用,它也会起作用。

如果文件没有BOM,则将使用defaultEncodingIfNoBom编码。还有一个不带此重载方法的StreamReader(在这种情况下,默认(ANSI)编码将用作defaultEncodingIfNoBom),但我建议您在上下文中定义您认为的默认编码。

我已经成功测试了具有UTF8,UTF16 / Unicode(LE&amp; BE)和UTF32(LE&amp; BE)的BOM的文件。它不适用于UTF7。

答案 2 :(得分:11)

我会尝试以下步骤:

1)检查是否有字节顺序标记

2)检查文件是否有效UTF8

3)使用本地“ANSI”代码页(Microsoft定义的ANSI)

步骤2有效,因为大多数非ASCII序列在其他UTF8无效UTF8的代码页中。

答案 3 :(得分:5)

检查一下。

UDE

这是Mozilla Universal Charset Detector的一个端口,您可以像这样使用它......

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}

答案 4 :(得分:1)

在这里查看c#

https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx

string path = @"path\to\your\file.ext";

using (StreamReader sr = new StreamReader(path, true))
{
    while (sr.Peek() >= 0)
    {
        Console.Write((char)sr.Read());
    }

    //Test for the encoding after reading, or at least
    //after the first read.
    Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
    Console.ReadLine();
    Console.WriteLine();
}

答案 5 :(得分:1)

以下代码是我的Powershell代码,用于确定某些cpp或h或ml文件是使用ISO-8859-1(Latin-1)编码还是使用没有BOM的UTF-8编码,如果两者都不认为它是GB18030。我是一名在法国工作的中国人,MSVC在法语计算机上保存为Latin-1,并在中文计算机上保存为GB,这样可以帮助我在系统和同事之间进行源文件交换时避免编码问题。

方法很简单,如果所有字符都在x00-x7E之间,ASCII,UTF-8和Latin-1都是相同的,但如果我用UTF-8读取非ASCII文件,我们会找到特殊字符 出现,所以尝试阅读Latin-1。在Latin-1中,\ x7F和\ xAF之间是空的,而GB在x00-xFF之间使用完全,所以如果我在两者之间有任何一个,那么它不是Latin-1

代码是用PowerShell编写的,但使用.net,因此很容易被翻译成C#或F#

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
    $openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
    $contentUTF = $openUTF.ReadToEnd()
    [regex]$regex = '�'
    $c=$regex.Matches($contentUTF).count
    $openUTF.Close()
    if ($c -ne 0) {
        $openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
        $contentLatin1 = $openLatin1.ReadToEnd()
        $openLatin1.Close()
        [regex]$regex = '[\x7F-\xAF]'
        $c=$regex.Matches($contentLatin1).count
        if ($c -eq 0) {
            [System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
            $i.FullName
        } 
        else {
            $openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
            $contentGB = $openGB.ReadToEnd()
            $openGB.Close()
            [System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
            $i.FullName
        }
    }
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');

答案 6 :(得分:1)

为@CodesInChaos建议的步骤提供实施细节:

1)检查是否有字节顺序标记

2)检查文件是否为有效的UTF8

3)使用本地“ ANSI”代码页(Microsoft定义为ANSI)

第2步之所以起作用,是因为除了UTF8之外,代码页中的大多数非ASCII序列都不是有效的UTF8。 https://stackoverflow.com/a/4522251/867248详细说明了该策略。

number_format($k,0,'','');

答案 7 :(得分:0)

这可能有用

string path = @"address/to/the/file.extension";

using (StreamReader sr = new StreamReader(path))
{ 
    Console.WriteLine(sr.CurrentEncoding);                        
}

答案 8 :(得分:0)

.NET并不是很有帮助,但是您可以尝试以下算法:

  1. 尝试按BOM(字节顺序标记)查找编码...很可能找不到
  2. 尝试解析为不同的编码

这是电话:

var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
    throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");

代码如下:

public class FileHelper
{
    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings       
    /// Defaults to UTF8 when detection of the text file's endianness fails.
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding or null.</returns>
    public static Encoding GetEncoding(string filename)
    {
        var encodingByBOM = GetEncodingByBOM(filename);
        if (encodingByBOM != null)
            return encodingByBOM;

        // BOM not found :(, so try to parse characters into several encodings
        var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
        if (encodingByParsingUTF8 != null)
            return encodingByParsingUTF8;

        var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
        if (encodingByParsingLatin1 != null)
            return encodingByParsingLatin1;

        var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
        if (encodingByParsingUTF7 != null)
            return encodingByParsingUTF7;

        return null;   // no encoding found
    }

    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM)  
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding.</returns>
    private static Encoding GetEncodingByBOM(string filename)
    {
        // Read the BOM
        var byteOrderMark = new byte[4];
        using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
        {
            file.Read(byteOrderMark, 0, 4);
        }

        // Analyze the BOM
        if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
        if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
        if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
        if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
        if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;

        return null;    // no BOM found
    }

    private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
    {            
        var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());

        try
        {
            using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
            {
                while (!textReader.EndOfStream)
                {                        
                    textReader.ReadLine();   // in order to increment the stream position
                }

                // all text parsed ok
                return textReader.CurrentEncoding;
            }
        }
        catch (Exception ex) { }

        return null;    // 
    }
}

答案 9 :(得分:0)

这似乎运作良好。

首先创建一个辅助方法:

  private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
    {
      try
      {
        var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
        var a = encoding.GetCharCount(byteArray);
        return testCode;
      }
      catch (Exception e)
      {
        return null;
      }
    }

然后创建代码来测试源代码。在这种情况下,我有一个字节数组,我需要获取以下编码:

 public static Encoding DetectCodePage(byte[] contents)
    {
      if (contents == null || contents.Length == 0)
      {
        return Encoding.Default;
      }

      return TestCodePage(Encoding.UTF8, contents)
             ?? TestCodePage(Encoding.Unicode, contents)
             ?? TestCodePage(Encoding.BigEndianUnicode, contents)
             ?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
             ?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
             ?? TestCodePage(Encoding.ASCII, contents)
             ?? TestCodePage(Encoding.Default, contents); // likely Unicode
    }