Question

所以我试图在C中编写一个比较函数，它可以采用UTF-8编码的Unicode字符串并使用Windows CompareStringEx()函数，我希望它能像.NET的CultureInfo.CompareInfo.Compare()一样工作。

现在我用C编写的函数在某些时候工作，但并非在所有情况下，我正在试图找出原因。这是一个失败的案例（传入C＃，而不是C）：

CultureInfo cultureInfo = new CultureInfo("en-US");
CompareOptions compareOptions = CompareOptions.IgnoreCase | CompareOptions.IgnoreKanaType | CompareOptions.IgnoreWidth;

string stringA = "คนอ้วน ๆ";
string stringB = "はじめまして";
//Result is -1 which is expected
int result = cultureInfo.CompareInfo.Compare(stringA, stringB);

以下是我用C编写的内容。请记住，这应该采用UTF-8编码的字符串并使用Windows CompareStringEx（）函数，因此需要进行转换。

// Compare flags for the string comparison
#define COMPARE_STRING_FLAGS (NORM_IGNORECASE | NORM_IGNOREKANATYPE | NORM_IGNOREWIDTH)

int CompareStrings(int lenA, const void *strA, int lenB, const void *strB) 
{
    LCID ENGLISH_LCID = MAKELCID(MAKELANGID(LANG_ENGLISH, SUBLANG_ENGLISH_US), SORT_DEFAULT);
    int compareString = -1;

    // Get the size of the strings as UTF-18 encoded Unicode strings. 
    // Note: Passing 0 as the last parameter forces the MultiByteToWideChar function
    // to give us the required buffer size to convert the given string to utf-16s
    int strAWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, NULL, 0);
    int strBWStrBufferSize = MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, NULL, 0);

    // Malloc the strings to store the converted UTF-16 values
    LPWSTR utf16StrA = (LPWSTR) GlobalAlloc(GMEM_FIXED, strAWStrBufferSize * sizeof(WCHAR));
    LPWSTR utf16StrB = (LPWSTR) GlobalAlloc(GMEM_FIXED, strBWStrBufferSize * sizeof(WCHAR));

    // Convert the UTF-8 strings (SQLite will pass them as UTF-8 to us) to standard  
    // windows WCHAR (UTF-16\UCS-2) encoding for Unicode so they can be used in the 
    // Windows CompareStringEx() function.
    if(strAWStrBufferSize != 0)
    {
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strA, lenA, utf16StrA, strAWStrBufferSize);
    }
    if(strBWStrBufferSize != 0)
    {
        MultiByteToWideChar(CP_UTF8, 0, (LPCSTR)strB, lenB, utf16StrB, strBWStrBufferSize);
    }

    // Compare the strings using the windows compare function.
    // Note: We subtract 1 from the size since we don't want to include the null termination character
    if(NULL != utf16StrA && NULL != utf16StrB)
    {
        compareValue = CompareStringEx(L"en-US", COMPARE_STRING_FLAGS, utf16StrA, strAWStrBufferSize - 1, utf16StrB, strBWStrBufferSize - 1, NULL, NULL, 0);
    }

    // In the Windows CompareStringEx() function, 0 indicates an error, 1 indicates less than, 
    // 2 indicates equal to, 3 indicates greater than so subtract 2 to maintain C convention
    if(compareValue > 0)
    {
        compareValue -= 2;
    }

    return compareValue;
}

现在，如果我运行以下代码，我希望结果为-1，基于.NET实现（见上文），但我得到1表示字符串大于：

char strA[50] = "คนอ้วน ๆ";
char strB[50] = "はじめまして";

// Will be 1 when we expect it to be -1
int result = CompareStrings(strlen(strA), strA, strlen(strB), strB);

关于为什么我得到的结果不同的任何想法？我在两个实现中使用相同的LCID / cultureInfo和compareOptions，据我所知，转换是成功的。

仅供参考：此功能将用作SQLite中的自定义排序规则。与问题无关，但如果有人想知道函数签名为何就是这样的话。

更新：我还确定在.NET 4中运行相同的代码时，我会看到我在本机代码中看到的行为。因此，.NET版本之间存在差异。请参阅下面的答案，了解其背后的原因。

Answer 1

嗯，你的代码在这里执行了几个步骤 - 目前尚不清楚它是否是失败的比较步骤。

作为第一步，我会在.NET代码和C代码中写出 - utf16StrA，utf16StrB，stringA中的确切UTF-16代码单元{1}}和stringB。我发现您在C代码中使用的输入数据存在问题时，我不会感到惊讶。

Answer 2

这里希望你的文本编辑器将以utf-8格式保存源代码文件。然后，编译器将以某种方式不将源代码解释为utf-8。至少在我的编译器上，这太过于希望了：

warning C4566: character represented by universal-character-name '\u0E04' cannot be represented in the current code page (1252)

修正：

const wchar_t* strA = L"คนอ้วน ๆ";
const wchar_t* strB = L"はじめまして";

并删除转换代码。

Answer 3

所以我在联系Microsoft支持后最终搞清楚了问题。以下是他们对此问题的看法：

您遇到的问题的原因，即使用相同的比较选项对相同的字符串运行CompareInfo.Compare，但在不同版本的.NET Framework下运行时获得不同的返回值，是排序规则是否绑定随着时间的推移，随着Unicode规范的发展。从历史上看，.NET已经为并排版本捕获数据以对应于最新版本的Windows以及当时实现的相应版本的Unicode，因此2.0,3.0和3.5对应于Windows XP或Server 2003的版本，而v4.0符合Vista排序规则。因此，各种版本的.NET Framework的排序规则随着时间的推移而发生了变化。

这也意味着当我运行本机代码时，我调用了遵循Vista排序规则的排序方法，当我在.NET 3.5中运行时，我运行的是使用Windows XP排序规则的排序方法。对我来说似乎很奇怪，Unicode规范会以导致如此巨大差异的方式发生变化，但显然就是这种情况。在我看来，以如此戏剧性的方式更改Unicode规范是打破向后兼容性的绝佳方式。

在C中比较Unicode字符串返回与C＃不同的值

3 个答案: