Question

我已经检查过AsciiEncoding的GetByteCount方法。它做了很长的计算而不是返回String.Length。这对我来说并没有任何意义。你知道为什么吗？

Answer 1

编辑：我刚尝试重现这个，我现在不能强迫ASCIIEncoding改为替换。相反，我必须使用Encoding.GetEncoding来获取可变的。所以对于ASCIIEncoding，我同意......但是对于IsSingleByte返回true的其他实现，你仍然会遇到下面的潜在问题。

考虑尝试获取不只包含ASCII字符的字符串的字节数。编码必须考虑EncoderFallback ...这可以做任何事情，包括增加不确定数量的计数。

可以针对编码器回退是“默认”的情况进行优化，只需用“？”替换非ASCII字符虽然。

进一步编辑：我只是试图将它与代理对混淆，希望它会用一个问号来表示。不幸的是没有：

string text = "x\ud800\udc00y";
Console.WriteLine(text.Length); // Prints 4
Console.WriteLine(Encoding.ASCII.GetByteCount(text)); // Still prints 4!

Answer 2

有趣的是，mono runtime doesn't seem to include that behaviour：

// Get the number of bytes needed to encode a character buffer.
public override int GetByteCount (char[] chars, int index, int count)
{
    if (chars == null) {
        throw new ArgumentNullException ("chars");
    }
    if (index < 0 || index > chars.Length) {
        throw new ArgumentOutOfRangeException ("index", _("ArgRange_Array"));
    }
    if (count < 0 || count > (chars.Length - index)) {
        throw new ArgumentOutOfRangeException ("count", _("ArgRange_Array"));
    }
    return count;
}

// Convenience wrappers for "GetByteCount".
public override int GetByteCount (String chars)
{
    if (chars == null) {
        throw new ArgumentNullException ("chars");
    }
    return chars.Length;
}

并进一步向下

[CLSCompliantAttribute(false)]
[ComVisible (false)]
public unsafe override int GetByteCount (char *chars, int count)
{
    return count;
}

Answer 3

对于像UTF8这样的多字节字符编码，这种方法很有意义，因为字符以1到6个字节存储。我想，该方法也适用于像ASCII这样的固定大小编码，其中每个字符都以7位存储。然而，在实际实现中，"aaaaaaaa"将是8个字节，因为ASCII中的字符存储在1个字节（8位）中，因此lenght hack将在最佳情况下工作。

以前版本的.NET Framework通过忽略第8位来允许欺骗。当前版本已更改，以便在解码字节期间非ASCII代码点回落。
来源：MSDN

我理解您的问题为：Does worst case scenario exist for lenght hack?

        Encoding ae = Encoding.GetEncoding(
              "us-ascii",
              new EncoderReplacementFallback("[lol]"),
              new DecoderReplacementFallback("[you broke Me]"));

        Console.WriteLine(ae.GetByteCount("õäöü"));

这将返回20，因为字符串"õäöü"包含4个字符，所有字符集限制均为"us-ascii"字符集限制（ U + 0000 至 U + 007F 。），因此在编码器之后，文本将为"[lol][lol][lol][lol]"。

为什么IsSingleByte Encoding的GetByteCount会进行计算

3 个答案: