Question

我在一个javascript开源项目中遇到过这段代码。

validator.isLength = function (str, min, max) 
    // match surrogate pairs in string or declare an empty array if none found in string
    var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
    // subtract the surrogate pairs string length from main string length
    var len = str.length - surrogatePairs.length;
    // now compare string length with min and max ... also make sure max is defined(in other words, max param is optional for function)
    return len >= min && (typeof max === 'undefined' || len <= max);
};

据我所知，上面的代码是检查字符串的长度，但不考虑代理对。所以：

我对代码的理解是否正确？
什么是代理对？

到目前为止，我只知道这与编码有关。

Answer 1

是。你的理解是正确的。该函数返回Unicode代码点中字符串的长度。
JavaScript正在使用UTF-16对其字符串进行编码。这意味着两个字节（16位）用于表示一个Unicode字符。

现在Unicode中的字符（如Emojis）具有高代码点，因此它们不能存储在2个字节（16位）中，因此需要将它们编码为2个UTF-16字符（4个字节）。这些被称为代理对。

试试这个

var len = "".length //There is an emoji in the string (if you don’t see it)

VS

var str = ""
var surrogatePairs = str.match(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g) || [];
var len = str.length - surrogatePairs.length;

在第一个示例中，len将为2，因为表情符号由两个2个UTF-16字符组成。在第二个示例中，len将为1.

您可能想要阅读 The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Answer 2

对于你的第二个问题： 1. What is a "surrogate pair" in Java? 术语＆＃34;代理对＆＃34;指的是在UTF-16编码方案中使用高代码点编码Unicode字符的方法。

在Unicode字符编码中，字符映射到0x0和0x10FFFF之间的值。

在内部，Java使用UTF-16编码方案来存储Unicode文本的字符串。在UTF-16中，使用16位（双字节）代码单元。由于16位只能包含从0x0到0xFFFF的字符范围，因此使用一些额外的复杂度来存储高于此范围（0x10000到0x10FFFF）的值。这是使用称为代理的代码单元对完成的。

代理代码单元分为两个范围，称为＆＃34;低代理＆＃34;和＃34;高代理＆＃34;，取决于它们是否被允许在两个代码单元序列的开头或结尾。

https://msdn.microsoft.com/en-us/library/windows/desktop/dd374069%28v=vs.85%29.aspx?f=255&MSPPError=-2147217396

希望这有帮助。

Answer 3

你试过谷歌吗？

最佳描述是http://unicodebook.readthedocs.io/unicode_encodings.html#surrogates

在UTF-16中，一些字符以8位存储，其他字符以16位存储。

代理对是一个16位的字符表示。有些字符代码保留为此类对中的第一个字符代码。

什么是代理对？

3 个答案: