我使用的一些API要求输入字符串是有效的UTF8字符串,最大长度为4096字节。
我具有以下功能来修剪多余的字符:
private static string GetTelegramMessage(string message)
{
const int telegramMessageMaxLength = 4096; // https://core.telegram.org/method/messages.sendMessage#return-errors
const string tooLongMessageSuffix = "...";
if (message == null || message.Length <= 4096)
{
return message;
}
return message.Remove(telegramMessageMaxLength - tooLongMessageSuffix.Length) + tooLongMessageSuffix;
}
它不能很好地工作,因为字符!=字节和UTF16字符!= UTF8字符。
因此,基本上我需要将C#UTF16
字符串转换为固定长度的UTF8
字符串。我会
var bytes = Encoding.UTF8.GetBytes(myString);
// now I need to get first N characters with overall bytes size less than 4096 bytes
我可以在Rust中表达我的需求(以下工作示例):
fn main() {
let foo = format!("{}{}", "ᚠᛇᚻ᛫ᛒᛦᚦ᛫ᚠᚱᚩᚠᚢᚱ᛫ᚠᛁᚱᚪ᛫ᚷᛖᚻᚹᛦᛚᚳᚢᛗ Uppen Sevarne staþe, sel þar him þuhte", (1..5000).map(|_| '1').collect::<String>());
println!("{}", foo.len());
let message = get_telegram_message(&foo);
println!("{}", message);
println!("{}", message.chars().count()); // 4035
println!("{}", message.len()); // 4096
}
pub fn get_telegram_message(foo: &str) -> String {
const PERIOD: &'static str = "...";
const MAX_LENGTH: usize = 4096;
let message_length = MAX_LENGTH - PERIOD.len();
foo.chars()
.map(|c| (c, c.len_utf8())) // getting length for evey char
.scan((0, '\0'), |(s, _), (c, size)| {
*s += size; // running total for all previosely seen characters
Some((*s, c))
})
.take_while(|(len, _)| len <= &message_length) // taking while running total is less than maximum message size
.map(|(_, c)| c)
.chain(PERIOD.chars()) // add trailing ellipsis
.collect() // building a string
}
这里的问题是我在C#中没有chars()
迭代器,因此我无法将字节序列视为UTF8字符。
我已经玩过Encoding.UTF8
了,但是找不到合适的API来执行此任务。
链接文章与我的问题有某种联系,但是首先回答非常糟糕,第二个问题是重新实现UTF8迭代器(这就是我在下面称为IEnumerable<long>
的意思)。因为我知道如何实现它,所以我对执行此任务的内置函数的问题无所适从。
答案 0 :(得分:1)
我认为Encoder.Convert
可能是您追求的方法。
我把这个问题解释为意思
我有一个字符串,它将被转换为UTF-8字节。我想对其进行修整以使其最大UTF-8编码为4096字节,但我想确保不要在UTF-8代码点的中间对它进行修整。
private static string GetTelegramMessage(string message)
{
const int telegramMessageMaxLength = 4096; // https://core.telegram.org/method/messages.sendMessage#return-errors
const string tooLongMessageSuffix = "...";
if (string.IsNullOrEmpty(message) || Encoding.UTF8.GetByteCount(message) <= telegramMessageMaxLength)
{
return message;
}
var encoder = Encoding.UTF8.GetEncoder();
byte[] buffer = new byte[telegramMessageMaxLength - Encoding.UTF8.GetByteCount(tooLongMessageSuffix)];
char[] messageChars = message.ToCharArray();
encoder.Convert(
chars: messageChars,
charIndex: 0,
charCount: messageChars.Length,
bytes: buffer,
byteIndex: 0,
byteCount: buffer.Length,
flush: false,
charsUsed: out int charsUsed,
bytesUsed: out int bytesUsed,
completed: out bool completed);
// I don't think we can return message.Substring(0, charsUsed)
// as that's the number of UTF-16 chars, not the number of codepoints
// (think about surrogate pairs). Therefore I think we need to
// actually convert bytes back into a new string
return Encoding.UTF8.GetString(bytes, 0, bytesUsed) + tooLongMessageSuffix;
}