我将第三方的一些字符串保存到我的数据库中(postgres)。有时这些字符串太长,需要截断以适应我表中的列。
在一些随机的场合,我不小心将字符串截断到有Unicode字符的地方,这给了我一个"破坏"我无法保存到数据库中的字符串。我收到以下错误:Unable to translate Unicode character \uD83D at index XXX to specified code page
。
我已经创建了一个最小的例子来向您展示我的意思。这里我有一个包含Unicode字符的字符串("小蓝钻" U + 1F539)。根据我截断的位置,它会给我一个有效的字符串。
var myString = @"This is a string before an emoji: This is after the emoji.";
var brokenString = myString.Substring(0, 34);
// Gives: "This is a string before an emoji:☐"
var test3 = myString.Substring(0, 35);
// Gives: "This is a string before an emoji:"
有没有办法让我截断字符串而不会意外地破坏任何Unicode字符?
答案 0 :(得分:4)
Unicode字符可以用多个char
表示,这就是您遇到的string.Substring
问题。
您可以将string
转换为StringInfo
对象,然后使用SubstringByTextElements()
method根据Unicode字符数获取子字符串,而不是char
计数。
查看C# demo:
Console.WriteLine("".Length); // => 2
Console.WriteLine(new StringInfo("").LengthInTextElements); // => 1
var myString = @"This is a string before an emoji:This is after the emoji.";
var teMyString = new StringInfo(myString);
Console.WriteLine(teMyString.SubstringByTextElements(0, 33));
// => "This is a string before an emoji:"
Console.WriteLine(teMyString.SubstringByTextElements(0, 34));
// => This is a string before an emoji:
Console.WriteLine(teMyString.SubstringByTextElements(0, 35));
// => This is a string before an emoji:T
答案 1 :(得分:0)
我最终使用xanatos answer here的修改。不同之处在于此版本将删除最后一个字形,如果添加它会产生长于length
的字符串。
public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
{
if (str == null)
{
throw new ArgumentNullException(nameof(str));
}
if (startIndex < 0 || startIndex > str.Length)
{
throw new ArgumentOutOfRangeException(nameof(startIndex));
}
if (length < 0)
{
throw new ArgumentOutOfRangeException(nameof(length));
}
if (startIndex + length > str.Length)
{
throw new ArgumentOutOfRangeException(nameof(length));
}
if (length == 0)
{
return string.Empty;
}
var stringBuilder = new StringBuilder(length);
var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);
while (enumerator.MoveNext())
{
var grapheme = enumerator.GetTextElement();
startIndex += grapheme.Length;
if (startIndex > str.Length)
{
break;
}
// Skip initial Low Surrogates/Combining Marks
if (stringBuilder.Length == 0)
{
if (char.IsLowSurrogate(grapheme[0]))
{
continue;
}
var cat = char.GetUnicodeCategory(grapheme, 0);
if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
{
continue;
}
}
// Do not append the grapheme if the resulting string would be longer than the required length
if (stringBuilder.Length + grapheme.Length <= length)
{
stringBuilder.Append(grapheme);
}
if (stringBuilder.Length >= length)
{
break;
}
}
return stringBuilder.ToString();
}
}
答案 2 :(得分:0)
以下是截断(startIndex = 0)的示例:
string truncatedStr = (str.Length > maxLength)
? str.Substring(0, maxLength - (char.IsLowSurrogate(str[maxLength]) ? 1 : 0))
: str;
答案 3 :(得分:0)
更好的截断字节数而不是字符串长度
public static string TruncateByBytes(this string text, int maxBytes)
{
if (string.IsNullOrEmpty(text) || Encoding.UTF8.GetByteCount(text) <= maxBytes)
{
return text;
}
var enumerator = StringInfo.GetTextElementEnumerator(text);
var newStr = string.Empty;
do
{
enumerator.MoveNext();
if (Encoding.UTF8.GetByteCount(newStr + enumerator.Current) <= maxBytes)
{
newStr += enumerator.Current;
}
else
{
break;
}
} while (true);
return newStr;
}