Question

我有一个C＃方法需要检索字符串的第一个字符，看看它是否存在于包含特定unicode字符的HashSet中（所有从右到左的字符）。

所以我正在做

var c = str[0];

然后检查hashset。

问题是此代码不适用于第一个char的代码点大于65535的字符串。

我实际创建了一个循环，遍历0到70,000之间的所有数字（最高RTL代码点大约为68,000，所以我向上舍入），我从数字创建一个字节数组，并使用

Encoding.UTF32.GetString(intValue);

创建具有此角色的字符串。然后我将它传递给在HashSet中搜索的方法，并且该方法失败，因为它何时获得

str[0]

这个价值永远不会是它应该是什么。

我做错了什么？

Answer 1

String是一系列UTF-16代码单元，一个或两个编码Unicode代码点。如果要从字符串中获取代码点，则必须迭代字符串中的代码点。 A＆＃34;字符＆＃34;也是一个基本代码点，后跟一系列零个或多个组合代码点（＆＃34;组合字符＆＃34;）。

// Use a HashSet<String>

var itor = StringInfo.GetTextElementEnumerator(s);
while (itor.MoveNext()) {
    var character = itor.GetTextElement();
    // find character in your HashSet
}

如果您不需要考虑合并代码点，则可以将其删除。（但它们在某些语言中非常重要。）

Answer 2

对于将来看到这个问题并且对我最终得到的解决方案感兴趣的人 - 这是我的方法，它决定字符串是否应该根据字符串中的第一个字符显示RTL或LTR。它需要考虑UTF-16代理对。

感谢Tom Blodget指出了我正确的方向。

if (string.IsNullOrEmpty(str)) return null;

var firstChar = str[0];
if (firstChar >= 0xd800 && firstChar <= 0xdfff)
{
    // if the first character is between 0xD800 - 0xDFFF, this is the beginning
    // of a UTF-16 surrogate pair. there MUST be one more char after this one,
    // in the range 0xDC00-0xDFFF. 
    // for the very unreasonable chance that this is a corrupt UTF-16 string
    // and there is no second character, validate the string length
    if (str.Length == 1) return FlowDirection.LeftToRight;

    // convert surrogate pair to a 32 bit number, and check the codepoint table
    var highSurrogate = firstChar - 0xd800;
    var lowSurrogate = str[1] - 0xdc00;
    var codepoint = (highSurrogate << 10) + (lowSurrogate) + 0x10000;

    return _codePoints.Contains(codepoint)
        ? FlowDirection.RightToLeft
        : FlowDirection.LeftToRight;
}
return _codePoints.Contains(firstChar)
    ? FlowDirection.RightToLeft
    : FlowDirection.LeftToRight;

Answer 3

我不确定我理解你的问题，一小段代码可能会有用。当你有一个类似＆＃39; var c = str [0]＆＃39;的行时，假设＆＃39; str＆＃39;是一个字符串，然后c将是一个char，它编码为UTF16。因为这个c永远不会大于（2 ^ 16 - 1）。 Unicode字符可以比那个大，但是当它出现时，它们被编码为跨越多个字符＆＃39;位置。在UTF-16的情况下，第一个＆＃39;字符可能占用1或2个16位值。

C＃：读取字符串的第一个字符，当char的unicode值为＆gt; 65535

3 个答案: