如何将数据库中存储在ANSI(Windows 1252)中的值转换为UTF-8

时间:2015-10-09 08:31:10

标签: c# sqlite encoding utf-8

当我在Sqlite浏览器中打开旧数据库时,文本显示错误。我可以设置的唯一编码是UTF-8和UTF-16 Sqlite browser with Umlaut

当我查询数据库时,Visual Studio中的编码已经错误了 Visual Studio locals

我假设文本是用ANSI编码的(Windows-1252)(在评论中确认)。我尝试将其转换为UTF-8

        var encoding = Encoding.GetEncoding(1252);
        byte[] encBytes = encoding.GetBytes(result);
        byte[] utf8Bytes = Encoding.Convert(encoding, Encoding.UTF8, encBytes);
        return Encoding.UTF8.GetString(utf8Bytes);

但现在问号符号只是一个问号 Still wrong

不知何故,外部遗留应用程序正确显示它,所以似乎有办法。但我不确定接下来我能尝试什么。

3 个答案:

答案 0 :(得分:3)

我曾遇到过同样的问题,

John Skeet回答here

基本上取字符串,获取编码为的错误编码的字节,然后在编码中得到它真正的字符串:

string broken = "Brokers México, Intermediario de Aseguro,S.A."; // Get text from database
byte[] encoded = Encoding.GetEncoding(28591).GetBytes(broken);
string corrected = Encoding.UTF8.GetString(encoded);

所以你应该只是

string broken = "Whatever";
byte[] encoded = Encoding.GetEncoding(1252).GetBytes(broken);
string corrected = Encoding.UTF8.GetString(encoded);

基本上,既然您知道重新转换程序是正确的,那么我会玩这里提到的编码:
https://msdn.microsoft.com/en-us/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
(只需编写一个程序来测试那里列出的所有可能的可能性,看看哪一对会产生匹配......)

如果你知道源文本,你甚至可以自动执行检查:

public partial class Form1 : Form
{
    public System.Data.DataTable dt;

    public Form1()
    {
        InitializeComponent();
    }




    private void btnTest_Click(object sender, EventArgs e)
    {
        dt = new System.Data.DataTable();

        string correct = "Brokers México, Intermediario de Aseguro,S.A.";

        string broken = "Brokers México, Intermediario de Aseguro,S.A."; // Get text from database

        dt.Columns.Add("SourceEncoding", typeof(string));
        dt.Columns.Add("TargetEncoding", typeof(string));
        dt.Columns.Add("Result", typeof(string));
        dt.Columns.Add("SourceEncodingName", typeof(string));
        dt.Columns.Add("TargetEncodingName", typeof(string));

        // For reference
        // https://msdn.microsoft.com/en-us/library/system.text.encodinginfo.getencoding(v=vs.110).aspx
        int[] encs = new int[] { 
             20127 // US-ASCII
            ,28591 // iso-8859-1 Western European (ISO)       
            ,28592 // iso-8859-2 Central European (ISO)       
            ,28593 // iso-8859-3 Latin 3 (ISO)
            ,28594 // iso-8859-4 Baltic (ISO)
            ,28595 // iso-8859-5 Cyrillic (ISO)
            ,28596 // iso-8859-6 Arabic (ISO)
            ,28597 // iso-8859-7 Greek (ISO)
            ,28598 // iso-8859-8 Hebrew (ISO-Visual)          
            ,28599 // iso-8859-9 Turkish (ISO)
            ,28603 // iso-8859-13 Estonian (ISO)
            ,28605 // iso-8859-15 Latin 9 (ISO)   

            ,1250 // windows-1250 Central European (Windows)      
            ,1251 // windows-1251 Cyrillic (Windows)             
            ,1252 // Windows-1252 Western European (Windows)      
            ,1253 // windows-1253 Greek (Windows)                
            ,1254 // windows-1254 Turkish (Windows)              
            ,1255 // windows-1255 Hebrew (Windows)               
            ,1256 // windows-1256 Arabic (Windows)               
            ,1257 // windows-1257 Baltic (Windows)               
            ,1258 // windows-1258 Vietnamese (Windows)

            ,20866 // Cyrillic (KOI8-R)
            ,21866 // Cyrillic (KOI8-U)  

            ,65000 // UTF-7
            ,65001 // UTF-8
            ,1200 // UTF-16
            ,1201 // Unicode (Big-Endian)    

            ,12000 // UTF-32
            ,12001 // UTF-32BE (UTF-32 Big-Endian) 
        };


        for (int i = 0; i < encs.Length; ++i)
        {

            for (int j = 0; j < encs.Length; ++j)
            {
                System.Data.DataRow dr = dt.NewRow();

                dr["SourceEncoding"] = encs[i];
                dr["TargetEncoding"] = encs[j];


                System.Text.Encoding enci = Encoding.GetEncoding(encs[i]);
                System.Text.Encoding encj = Encoding.GetEncoding(encs[j]);

                byte[] encoded = enci.GetBytes(broken);
                string corrected = encj.GetString(encoded);

                dr["Result"] = corrected;

                dr["SourceEncodingName"] = enci.BodyName;
                dr["TargetEncodingName"] = encj.BodyName;


                if (StringComparer.InvariantCultureIgnoreCase.Equals(correct, corrected))
                    dt.Rows.Add(dr);
            }

        }

        this.dataGridView1.DataSource = dt;
    }
}

或者甚至更彻底,只测试所有编码:

private void btnTestAll_Click(object sender, EventArgs e)
{
    dt = new System.Data.DataTable();

    string correct = "Brokers México, Intermediario de Aseguro,S.A.";

    string broken = "Brokers México, Intermediario de Aseguro,S.A."; // Get text from database

    dt.Columns.Add("SourceEncoding", typeof(string));
    dt.Columns.Add("TargetEncoding", typeof(string));
    dt.Columns.Add("Result", typeof(string));
    dt.Columns.Add("SourceEncodingName", typeof(string));
    dt.Columns.Add("TargetEncodingName", typeof(string));



    System.Text.EncodingInfo[] encs = System.Text.Encoding.GetEncodings();

    for (int i = 0; i < encs.Length; ++i)
    {

        for (int j = 0; j < encs.Length; ++j)
        {
            System.Data.DataRow dr = dt.NewRow();

            dr["SourceEncoding"] = encs[i].CodePage;
            dr["TargetEncoding"] = encs[j].CodePage;


            System.Text.Encoding enci = System.Text.Encoding.GetEncoding(encs[i].CodePage);
            System.Text.Encoding encj = System.Text.Encoding.GetEncoding(encs[j].CodePage);

            byte[] encoded = enci.GetBytes(broken);
            string corrected = encj.GetString(encoded);

            dr["Result"] = corrected;

            dr["SourceEncodingName"] = enci.BodyName;
            dr["TargetEncodingName"] = encj.BodyName;


            if (StringComparer.InvariantCultureIgnoreCase.Equals(correct, corrected))
                dt.Rows.Add(dr);
        }

    }

    this.dataGridView1.DataSource = dt;
}

您可以下载结果here

奇怪的是,看起来你可以从德国/ ANSI(或ISO-8859-1)获得ASCII,但没有办法将其转换回来(信息丢失)......

public static string lol()
{
    string source = "Alu-Dreieckstütze";

    // System.Text.Encoding encSource = System.Text.Encoding.Default;
    System.Text.Encoding encSource = System.Text.Encoding.GetEncoding(28591);
    System.Text.Encoding encTarget = System.Text.Encoding.ASCII;

    byte[] encoded = encSource.GetBytes(source);
    string broken = encTarget.GetString(encoded);

    return broken;
}

有趣的是,由于旧版应用程序正确显示它,它不会丢失信息。

所以你确定你没有在Sqlite connectionString中输入错误(或没有)编码吗?

e.g。

  "Data Source=C:\\Users\\USERNAME\\Desktop\\location.db; Version=3; UseUTF16Encoding=True;Synchronous=Normal;New=False"; // set up the connection string

https://www.sqlite.org/c3ref/c_any.html

您似乎可以使用pragma encoding

测试编码

答案 1 :(得分:0)

2个步骤:
首先,您将数据库中的值读取为bytes数组 其次,将1252编码的bytes数组转换为字符串 这样的事情:

byte[] buffer = dataReader["colomnName"];
var encoding = Encoding.GetEncoding(28591);
string s = encoding.GetString(buffer);

答案 2 :(得分:0)

我也确实从错误编码字符串的源中导入数据。但是使用Microsoft.Data.SQLite库,注入用户定义的函数来修复编码非常容易。在该示例中,我还使用了Dapper

using (var cnn = new SqliteConnection($"Data Source={databasePath}")) {
    cnn.CreateFunction("fixencoding", (byte[] value) =>
        Encoding.GetEncoding(1252).GetString(value), isDeterministic: true);
    cnn.Open();
    return cnn.Query<Board>(Properties.Resources.GetBoards);
}

对于此类:

public class Board
{
    public string Code { get; set; }
    public string Description { get; set }
    public decimal Length { get; set; }
    public decimal Width { get; set; }
    public decimal Thickness { get; set; }
    public int Quantity { get; set; }
}

和该查询(Properties.Resources.GetBoards):

SELECT
  fixencoding(CODE) AS Code,
  fixencoding(DESC) AS Description,
  LNGT AS Length,
  WIDT AS Width,
  THCK AS Thickness,
  QNTY AS Quantity
FROM
  BOARDS

如果源使用相同的系统区域设置,则可以仅使用Encoding.Default.GetString(value)而不是Encoding.GetEncoding(1252).GetString(value)