是的,这是一个最常见的问题,这个问题对我来说很模糊,因为我不太了解它。
但我想要一种非常精确的方法来查找文件编码。 如Notepad ++那么精确。
答案 0 :(得分:126)
StreamReader.CurrentEncoding
属性很少为我返回正确的文本文件编码。通过分析字节顺序标记(BOM),我在确定文件字节序方面取得了更大的成功:
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
// Read the BOM
var bom = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(bom, 0, 4);
}
// Analyze the BOM
if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return Encoding.UTF32;
return Encoding.ASCII;
}
作为旁注,您可能希望修改此方法的最后一行以返回Encoding.Default
,因此默认情况下会返回操作系统当前ANSI代码页的编码。
答案 1 :(得分:39)
以下代码适用于我,使用StreamReader
类:
using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
{
reader.Peek(); // you need this!
var encoding = reader.CurrentEncoding;
}
诀窍是使用Peek
调用,否则,.NET没有做任何事情(并且它没有读取前导码,BOM)。当然,如果您在检查编码之前使用任何其他ReadXXX
调用,它也会起作用。
如果文件没有BOM,则将使用defaultEncodingIfNoBom
编码。还有一个不带此重载方法的StreamReader(在这种情况下,默认(ANSI)编码将用作defaultEncodingIfNoBom),但我建议您在上下文中定义您认为的默认编码。
我已经成功测试了具有UTF8,UTF16 / Unicode(LE&amp; BE)和UTF32(LE&amp; BE)的BOM的文件。它不适用于UTF7。
答案 2 :(得分:11)
我会尝试以下步骤:
1)检查是否有字节顺序标记
2)检查文件是否有效UTF8
3)使用本地“ANSI”代码页(Microsoft定义的ANSI)
步骤2有效,因为大多数非ASCII序列在其他UTF8无效UTF8的代码页中。
答案 3 :(得分:5)
检查一下。
这是Mozilla Universal Charset Detector的一个端口,您可以像这样使用它......
public static void Main(String[] args)
{
string filename = args[0];
using (FileStream fs = File.OpenRead(filename)) {
Ude.CharsetDetector cdet = new Ude.CharsetDetector();
cdet.Feed(fs);
cdet.DataEnd();
if (cdet.Charset != null) {
Console.WriteLine("Charset: {0}, confidence: {1}",
cdet.Charset, cdet.Confidence);
} else {
Console.WriteLine("Detection failed.");
}
}
}
答案 4 :(得分:1)
在这里查看c#
https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx
string path = @"path\to\your\file.ext";
using (StreamReader sr = new StreamReader(path, true))
{
while (sr.Peek() >= 0)
{
Console.Write((char)sr.Read());
}
//Test for the encoding after reading, or at least
//after the first read.
Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
Console.ReadLine();
Console.WriteLine();
}
答案 5 :(得分:1)
以下代码是我的Powershell代码,用于确定某些cpp或h或ml文件是使用ISO-8859-1(Latin-1)编码还是使用没有BOM的UTF-8编码,如果两者都不认为它是GB18030。我是一名在法国工作的中国人,MSVC在法语计算机上保存为Latin-1,并在中文计算机上保存为GB,这样可以帮助我在系统和同事之间进行源文件交换时避免编码问题。
方法很简单,如果所有字符都在x00-x7E之间,ASCII,UTF-8和Latin-1都是相同的,但如果我用UTF-8读取非ASCII文件,我们会找到特殊字符 出现,所以尝试阅读Latin-1。在Latin-1中,\ x7F和\ xAF之间是空的,而GB在x00-xFF之间使用完全,所以如果我在两者之间有任何一个,那么它不是Latin-1
代码是用PowerShell编写的,但使用.net,因此很容易被翻译成C#或F#
$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
$openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
$contentUTF = $openUTF.ReadToEnd()
[regex]$regex = '�'
$c=$regex.Matches($contentUTF).count
$openUTF.Close()
if ($c -ne 0) {
$openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
$contentLatin1 = $openLatin1.ReadToEnd()
$openLatin1.Close()
[regex]$regex = '[\x7F-\xAF]'
$c=$regex.Matches($contentLatin1).count
if ($c -eq 0) {
[System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
$i.FullName
}
else {
$openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
$contentGB = $openGB.ReadToEnd()
$openGB.Close()
[System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
$i.FullName
}
}
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');
答案 6 :(得分:1)
为@CodesInChaos建议的步骤提供实施细节:
1)检查是否有字节顺序标记
2)检查文件是否为有效的UTF8
3)使用本地“ ANSI”代码页(Microsoft定义为ANSI)
第2步之所以起作用,是因为除了UTF8之外,代码页中的大多数非ASCII序列都不是有效的UTF8。 https://stackoverflow.com/a/4522251/867248详细说明了该策略。
number_format($k,0,'','');
答案 7 :(得分:0)
这可能有用
string path = @"address/to/the/file.extension";
using (StreamReader sr = new StreamReader(path))
{
Console.WriteLine(sr.CurrentEncoding);
}
答案 8 :(得分:0)
.NET并不是很有帮助,但是您可以尝试以下算法:
这是电话:
var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");
代码如下:
public class FileHelper
{
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings
/// Defaults to UTF8 when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding or null.</returns>
public static Encoding GetEncoding(string filename)
{
var encodingByBOM = GetEncodingByBOM(filename);
if (encodingByBOM != null)
return encodingByBOM;
// BOM not found :(, so try to parse characters into several encodings
var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
if (encodingByParsingUTF8 != null)
return encodingByParsingUTF8;
var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
if (encodingByParsingLatin1 != null)
return encodingByParsingLatin1;
var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
if (encodingByParsingUTF7 != null)
return encodingByParsingUTF7;
return null; // no encoding found
}
/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM)
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
private static Encoding GetEncodingByBOM(string filename)
{
// Read the BOM
var byteOrderMark = new byte[4];
using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
{
file.Read(byteOrderMark, 0, 4);
}
// Analyze the BOM
if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;
return null; // no BOM found
}
private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
{
var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());
try
{
using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
{
while (!textReader.EndOfStream)
{
textReader.ReadLine(); // in order to increment the stream position
}
// all text parsed ok
return textReader.CurrentEncoding;
}
}
catch (Exception ex) { }
return null; //
}
}
答案 9 :(得分:0)
这似乎运作良好。
首先创建一个辅助方法:
private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
{
try
{
var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
var a = encoding.GetCharCount(byteArray);
return testCode;
}
catch (Exception e)
{
return null;
}
}
然后创建代码来测试源代码。在这种情况下,我有一个字节数组,我需要获取以下编码:
public static Encoding DetectCodePage(byte[] contents)
{
if (contents == null || contents.Length == 0)
{
return Encoding.Default;
}
return TestCodePage(Encoding.UTF8, contents)
?? TestCodePage(Encoding.Unicode, contents)
?? TestCodePage(Encoding.BigEndianUnicode, contents)
?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
?? TestCodePage(Encoding.ASCII, contents)
?? TestCodePage(Encoding.Default, contents); // likely Unicode
}