Question

我将第三方的一些字符串保存到我的数据库中（postgres）。有时这些字符串太长，需要截断以适应我表中的列。

在一些随机的场合，我不小心将字符串截断到有Unicode字符的地方，这给了我一个＆＃34;破坏＆＃34;我无法保存到数据库中的字符串。我收到以下错误：Unable to translate Unicode character \uD83D at index XXX to specified code page。

我已经创建了一个最小的例子来向您展示我的意思。这里我有一个包含Unicode字符的字符串（＆＃34;小蓝钻＆＃34; U + 1F539）。根据我截断的位置，它会给我一个有效的字符串。

var myString = @"This is a string before an emoji: This is after the emoji.";

var brokenString = myString.Substring(0, 34);
// Gives: "This is a string before an emoji:☐"

var test3 = myString.Substring(0, 35);
// Gives: "This is a string before an emoji:"

有没有办法让我截断字符串而不会意外地破坏任何Unicode字符？

Answer 1

Unicode字符可以用多个char表示，这就是您遇到的string.Substring问题。

您可以将string转换为StringInfo对象，然后使用SubstringByTextElements() method根据Unicode字符数获取子字符串，而不是char计数。

查看C# demo：

Console.WriteLine("".Length); // => 2
Console.WriteLine(new StringInfo("").LengthInTextElements); // => 1

var myString = @"This is a string before an emoji:This is after the emoji.";
var teMyString = new StringInfo(myString);
Console.WriteLine(teMyString.SubstringByTextElements(0, 33));
// => "This is a string before an emoji:"
Console.WriteLine(teMyString.SubstringByTextElements(0, 34));
// => This is a string before an emoji:
Console.WriteLine(teMyString.SubstringByTextElements(0, 35));
// => This is a string before an emoji:T

Answer 2

我最终使用xanatos answer here的修改。不同之处在于此版本将删除最后一个字形，如果添加它会产生长于length的字符串。

    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException(nameof(str));
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException(nameof(startIndex));
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException(nameof(length));
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException(nameof(length));
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var stringBuilder = new StringBuilder(length);

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            var grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            if (startIndex > str.Length)
            {
                break;
            }

            // Skip initial Low Surrogates/Combining Marks
            if (stringBuilder.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                var cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            // Do not append the grapheme if the resulting string would be longer than the required length
            if (stringBuilder.Length + grapheme.Length <= length)
            {
                stringBuilder.Append(grapheme);
            }

            if (stringBuilder.Length >= length)
            {
                break;
            }
        }

        return stringBuilder.ToString();
    }
}

Answer 3

以下是截断（startIndex = 0）的示例：

string truncatedStr = (str.Length > maxLength)
    ? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0))
    : str;

Answer 4

更好的截断字节数而不是字符串长度

   public static string TruncateByBytes(this string text, int maxBytes)
    {
        if (string.IsNullOrEmpty(text) || Encoding.UTF8.GetByteCount(text) <= maxBytes)
        {
            return text;
        }
        var enumerator = StringInfo.GetTextElementEnumerator(text);
        var newStr = string.Empty;
        do
        {
            enumerator.MoveNext();
            if (Encoding.UTF8.GetByteCount(newStr + enumerator.Current) <= maxBytes)
            {
                newStr += enumerator.Current;
            }
            else
            {
                break;
            }
        } while (true);
        return newStr;
    }

截断字符串时意外拆分unicode字符

4 个答案: