截断字符串时意外拆分unicode字符

时间:2017-09-29 08:11:54

标签: c# string postgresql unicode

我将第三方的一些字符串保存到我的数据库中(postgres)。有时这些字符串太长,需要截断以适应我表中的列。

在一些随机的场合,我不小心将字符串截断到有Unicode字符的地方,这给了我一个"破坏"我无法保存到数据库中的字符串。我收到以下错误:Unable to translate Unicode character \uD83D at index XXX to specified code page

我已经创建了一个最小的例子来向您展示我的意思。这里我有一个包含Unicode字符的字符串("小蓝钻" U + 1F539)。根据我截断的位置,它会给我一个有效的字符串。

var myString = @"This is a string before an emoji: This is after the emoji.";

var brokenString = myString.Substring(0, 34);
// Gives: "This is a string before an emoji:☐"

var test3 = myString.Substring(0, 35);
// Gives: "This is a string before an emoji:"

有没有办法让我截断字符串而不会意外地破坏任何Unicode字符?

4 个答案:

答案 0 :(得分:4)

Unicode字符可以用多个char表示,这就是您遇到的string.Substring问题。

您可以将string转换为StringInfo对象,然后使用SubstringByTextElements() method根据Unicode字符数获取子字符串,而不是char计数。

查看C# demo

Console.WriteLine("".Length); // => 2
Console.WriteLine(new StringInfo("").LengthInTextElements); // => 1

var myString = @"This is a string before an emoji:This is after the emoji.";
var teMyString = new StringInfo(myString);
Console.WriteLine(teMyString.SubstringByTextElements(0, 33));
// => "This is a string before an emoji:"
Console.WriteLine(teMyString.SubstringByTextElements(0, 34));
// => This is a string before an emoji:
Console.WriteLine(teMyString.SubstringByTextElements(0, 35));
// => This is a string before an emoji:T

答案 1 :(得分:0)

我最终使用xanatos answer here的修改。不同之处在于此版本将删除最后一个字形,如果添加它会产生长于length的字符串。

    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException(nameof(str));
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException(nameof(startIndex));
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException(nameof(length));
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException(nameof(length));
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var stringBuilder = new StringBuilder(length);

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            var grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            if (startIndex > str.Length)
            {
                break;
            }

            // Skip initial Low Surrogates/Combining Marks
            if (stringBuilder.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                var cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            // Do not append the grapheme if the resulting string would be longer than the required length
            if (stringBuilder.Length + grapheme.Length <= length)
            {
                stringBuilder.Append(grapheme);
            }

            if (stringBuilder.Length >= length)
            {
                break;
            }
        }

        return stringBuilder.ToString();
    }
}

答案 2 :(得分:0)

以下是截断(startIndex = 0)的示例:

string truncatedStr = (str.Length > maxLength)
    ? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0))
    : str;

答案 3 :(得分:0)

更好的截断字节数而不是字符串长度

   public static string TruncateByBytes(this string text, int maxBytes)
    {
        if (string.IsNullOrEmpty(text) || Encoding.UTF8.GetByteCount(text) <= maxBytes)
        {
            return text;
        }
        var enumerator = StringInfo.GetTextElementEnumerator(text);
        var newStr = string.Empty;
        do
        {
            enumerator.MoveNext();
            if (Encoding.UTF8.GetByteCount(newStr + enumerator.Current) <= maxBytes)
            {
                newStr += enumerator.Current;
            }
            else
            {
                break;
            }
        } while (true);
        return newStr;
    }