Question

我正在使用这个字符，双尖''，Unicode为0x1d12a。
如果我在字符串中使用它，我将无法获得正确的字符串长度：

str = "F"
str.length // returns 3, even though there are 2 characters!

无论我是否使用特殊的unicode，我如何获得返回正确答案的功能？

Answer 1

String.prototype.codes = function() { return [...this].length };
String.prototype.chars = function() {
    let GraphemeSplitter = require('grapheme-splitter');
    return (new GraphemeSplitter()).countGraphemes(this);
}

console.log("F".codes());     // 2
console.log("‍❤️‍‍".codes());     // 8
console.log("❤️".codes());      // 2

console.log("F".chars());     // 2
console.log("‍❤️‍‍".chars());     // 1
console.log("❤️".chars());      // 1

Answer 2

总结我的评论：

这只是该字符串的长度。

即使看起来像一个字符，某些字符也涉及其他字符。 "̉mủt̉ả̉̉̉t̉ẻd̉W̉ỏ̉r̉̉d̉̉".length == 24

从this (great) blog post开始，它们具有一个将返回正确长度的函数：

function fancyCount(str){
  const joiner = "\u{200D}";
  const split = str.split(joiner);
  let count = 0;
    
  for(const s of split){
    //removing the variation selectors
    const num = Array.from(s.split(/[\ufe00-\ufe0f]/).join("")).length;
    count += num;
  }
    
  //assuming the joiners are used appropriately
  return count / split.length;
}

console.log(fancyCount("F") == 2) // true

Answer 3

JavaScript（和Java）字符串使用UTF-16编码。

Unicode代码点U + 0046（F）使用以下1个编码单位在UTF-16中进行编码：

Unicode代码点U + 1D12A（0x0046）使用2个编码单位（称为“代理对”）以UTF-16编码：

这就是为什么得到0xD834 0xDD2A为3而不是2的原因。length计算编码的代码单位的数目，而不是Unicode代码点的数目。

Answer 4

这是我写的以代码点长度获取字符串长度的函数

function nbUnicodeLength(string){
    var stringIndex = 0;
    var unicodeIndex = 0;
    var length = string.length;
    var second;
    var first;
    while (stringIndex < length) {

        first = string.charCodeAt(stringIndex);  // returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index.
        if (first >= 0xD800 && first <= 0xDBFF && string.length > stringIndex + 1) {
            second = string.charCodeAt(stringIndex + 1);
            if (second >= 0xDC00 && second <= 0xDFFF) {
                stringIndex += 2;
            } else {
                stringIndex += 1;
            }
        } else {
            stringIndex += 1;
        }

        unicodeIndex += 1;
    }
    return unicodeIndex;
}

获取包含Unicode字符超过0xffff的字符串长度

4 个答案: