Question

我正在尝试找到一种在Javascript中压缩/解压缩字符串的方法。通过压缩我的意思是使字符串看起来更短（更少char）。这是我的目标。

这是一个如何运作的例子：

// The string that I want to make shorter
// It will only contain [a-zA-Z0-9] chars and some ponctuations like ()[]{}.,;'"!
var string = "I like bananas !";

// The compressed string, maybe something like "䐓㐛꯱字",
// which is shorter than the original
var shortString = compress(string);  

// The original string, "I like banana !"
var originalString = decompress(shortString);

这是我的第一个想法（也许有更好的方法来实现我的目标，如果是这样，我对它感兴趣）。

我知道我的原始字符串将是utf-8。所以我正在考虑使用utf-32进行编码，它应该将字符串的长度除以4。

但我不知道如何使用不同的编码来构造新的字符串。这是我到目前为止的代码不起作用......

function compress(string) {
    string = unescape(encodeURIComponent(string));
    var newString = '';

    for (var i = 0; i < string.length; i++) {
        var char = string.charCodeAt(i);
        newString += parseInt(char, 8).toString(32);
    }

    return newString;
}

Answer 1

由于您使用的是一组少于100个字符且javascript字符串以UTF-16编码（这意味着您有65536个可能的字符），您可以做的是连接字符代码以便有一个“每两个基本字符压缩的“字符”。这允许您将字符串压缩到一半的长度。

就像这样：

document.getElementById('compressBtn').addEventListener('click', function() {
  var stringToCompress = document.getElementById('tocompress').value;
  var compressedString = compress(stringToCompress);
  var decompressedString = decompress(compressedString);

  if (stringToCompress === decompressedString) {
    document.getElementById('display').innerHTML = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;
  } else {
    document.getElementById('display').innerHTML = "This string cannot be compressed"
  }

})


function compress(string) {
  string = unescape(encodeURIComponent(string));
  var newString = '',
    char, nextChar, combinedCharCode;

  for (var i = 0; i < string.length; i += 2) {
    char = string.charCodeAt(i);

    if ((i + 1) < string.length) {

      // You need to make sure that you don't have 3 digits second character else you  might go over 65536. 
      // But in UTF-16 the 32 characters aren't in your basic character set. But it's a limitation, anything
      // under charCode 32 will cause an error
      nextChar = string.charCodeAt(i + 1) - 31;

      // this is to pad the result, because you could have a code that is single digit, which would make 
      // decompression a bit harder
      combinedCharCode = char + "" + nextChar.toLocaleString('en', {
        minimumIntegerDigits: 2
      });

      // You take the concanated code string and convert it back to a number, then a character
      newString += String.fromCharCode(parseInt(combinedCharCode, 10));

    } else {

      // Here because you won't always have pair number length
      newString += string.charAt(i);
    }
  }
  return newString;
}

function decompress(string) {

  var newString = '',
    char, codeStr, firstCharCode, lastCharCode;

  for (var i = 0; i < string.length; i++) {
    char = string.charCodeAt(i);
    if (char > 132) {
      codeStr = char.toString(10);

      // You take the first part of the compressed char code, it's your first letter
      firstCharCode = parseInt(codeStr.substring(0, codeStr.length - 2), 10);

      // For the second one you need to add 31 back.
      lastCharCode = parseInt(codeStr.substring(codeStr.length - 2, codeStr.length), 10) + 31;

      // You put back the 2 characters you had originally
      newString += String.fromCharCode(firstCharCode) + String.fromCharCode(lastCharCode);
    } else {
      newString += string.charAt(i);
    }
  }
  return newString;
}

var stringToCompress = 'I like bananas!';
var compressedString = compress(stringToCompress);
var decompressedString = decompress(compressedString);

document.getElementById('display').innerHTML = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;

body {
  padding: 10px;
}

#tocompress {
  width: 200px;
}

<input id="tocompress" placeholder="enter string to compress" />
<button id="compressBtn">
  Compress input
</button>
<div id="display">

</div>

关于可能使用UTF-32进一步压缩，我不确定是否可能，我可能错了，但根据我的理解，这是不可行的。原因如下：

上述方法基本上是在一个2字节值中连接两个1字节值。这是可能的，因为javascript字符串以2个字节（或16位）编码（请注意，根据我的理解，引擎可能决定以不同的方式存储，从纯粹的内存空间的角度来看这种压缩是不必要的 - 最后说，最后，一个字符被认为是16位）。实现上述压缩的一种更简洁的方法实际上是使用二进制数而不是十进制数，这将更有意义。像这样举例如：

document.getElementById('compressBtn').addEventListener('click', function() {
  var stringToCompress = document.getElementById('tocompress').value;
  var compressedString = compress(stringToCompress);
  var decompressedString = decompress(compressedString);

  if (stringToCompress === decompressedString) {
    document.getElementById('display').innerHTML = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;
  } else {
    document.getElementById('display').innerHTML = "This string cannot be compressed"
  }

})


function compress(string) {
  string = unescape(encodeURIComponent(string));
  var newString = '',
    char, nextChar, combinedCharCode;

  for (var i = 0; i < string.length; i += 2) {
  
  // convert to binary instead of keeping the decimal
    char = string.charCodeAt(i).toString(2);

    if ((i + 1) < string.length) {

     
      nextChar = string.charCodeAt(i + 1).toString(2) ;
     

      // you still need padding, see this answer https://stackoverflow.com/questions/27641812/way-to-add-leading-zeroes-to-binary-string-in-javascript
      combinedCharCode = "0000000".substr(char.length) + char + "" + "0000000".substr(nextChar.length) + nextChar;

      // You take the concanated code string and convert it back to a binary number, then a character
      newString += String.fromCharCode(parseInt(combinedCharCode, 2));

    } else {

      // Here because you won't always have pair number length
      newString += string.charAt(i);
    }
  }
  return newString;
}

function decompress(string) {

  var newString = '',
    char, codeStr, firstCharCode, lastCharCode;

  for (var i = 0; i < string.length; i++) {
    char = string.charCodeAt(i);
    if (char > 132) {
      codeStr = char.toString(2);

      // You take the first part (the first byte) of the compressed char code, it's your first letter
      firstCharCode = parseInt(codeStr.substring(0, codeStr.length - 7), 2);

      // then the second byte
      lastCharCode = parseInt(codeStr.substring(codeStr.length - 7, codeStr.length), 2);

      // You put back the 2 characters you had originally
      newString += String.fromCharCode(firstCharCode) + String.fromCharCode(lastCharCode);
    } else {
      newString += string.charAt(i);
    }
  }
  return newString;
}

var stringToCompress = 'I like bananas!';
var compressedString = compress(stringToCompress);
var decompressedString = decompress(compressedString);

document.getElementById('display').innerHTML = stringToCompress + ", length of " + stringToCompress.length  + " characters compressed to " + compressedString + ", length of " + compressedString.length + " characters back to " + decompressedString;

<input id="tocompress" placeholder="enter string to compress" />
<button id="compressBtn">
  Compress input
</button>
<div id="display">

</div>

那么为什么不推动逻辑并使用utf-32，它应该是4个字节，意味着4个1字节字符。一个问题是javascript有2个字节的字符串。确实，您可以使用16位字符对来表示utf-32字符。像这样：

document.getElementById('test').innerHTML = "\uD834\uDD1E";

<div id="test"></div>

但是如果你测试结果字符串的长度，你会发现它是2，即使只有一个“字符”。所以从javascript的角度来看，你并没有减少实际的字符串长度。

另一方面，UTF-32实际上有2个²¹字符。见这里：https://en.wikipedia.org/wiki/UTF-32

这是一个编码完全使用32的Unicode代码点的协议每个Unicode代码点的位数（但前导位数必须为零）因为Unicode代码点少于221个）

所以你实际上没有4个字节，实际上你甚至没有3个，这需要编码3.所以UTF-32似乎不是一种压缩方式。由于javascript具有原生的2字节字符串，因此在我看来它是最有效的 - 至少使用这种方法。

Answer 2

如果您的字符串只包含ASCII字符[0,127]，您可以使用自定义的6位或7位代码页“压缩”字符串。

您可以通过多种方式执行此操作，但我认为其中一种更简单的方法是定义一个包含所有允许字符的数组 - 如果您愿意，可以使用LUT，lookup-table，然后使用其索引值作为编码值。您当然必须手动屏蔽并将编码值移动到类型化数组中。

如果您的LUT看起来像这样：

var lut = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,:;!(){}";

在这种情况下你会处理长度为71的LUT，这意味着我们需要使用7位范围或[0,127]（如果长度为64，我们可以将它减少到 6 -bit [0,63]值）。

然后你将把字符串中的每个字符转换成索引值（你通常会在一个操作中执行以下所有步骤，但为了简单起见我将它们分开）：

var lut = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,:;!(){}";
var str = "I like bananas !";
var page = [];

Array.prototype.forEach.call(str, function(ch) {
  var i = lut.indexOf(ch);
  if (i < 0) throw "Invalid character - can't encode";
  page.push(i);
});

console.log("Intermediate page:", page);

您可以随时调整LUT以使最常用的字符位于开头，然后支持变量编码位范围，查找最大值并使用它来确定要编码的范围。您可以添加初始位作为编码使用的范围的标志（例如，如果6位适合，则设置为位0，否则使用7位范围）。

现在您已经知道了索引，我们可以开始使用7位方法对二进制输出本身进行编码。由于JavaScript仅支持字节值，即8位宽度，因此我们必须手动执行所有拆分，移位和合并操作。

这意味着我们需要在位级别上跟踪余数和位置。

假设第一个索引值是以下7位值（可读性的完整7位范围 - 全部为伪格式）：

&b01111111

第一步是将其移至位0位并跟踪余数：

&b01111111 << 1

导致：

&b11111110
         ^
new bit position: 7
new remainder   : 1

然后是下一个索引值，例如：

&b01010101

将按此编码 - 首先在其自己的字节表示中转换为7位值：

&b01010101 << 1 => &b10101010

然后先获取提醒部分。为了获得这一点，将使用8位减去当前余数（在8的模数内）正确地改变一切：

remainderValue = &b10101010 >>> (8 - remainder)

给我们留下以下代表：

&b00000001

（请注意，我们使用三>>>向右移动以避免出现问号。）

现在接下来的步骤是将此值与我们先前已经编码并存储到目标字节数组中的值合并 - 为此我们将使用OR运算：

Index 0      New value     Result in index 0 (index of dst. array)
&b11111110 | &b00000001 => &b11111111

然后转到目标数组中的下一个索引并存储当前值的其余部分，然后更新余数和位置。

字节的“剩余”使用原始（移位后）7位字节值计算如下：

leftover = &b10101010 << remainder => &b01010100

我们现在进入下一个位置：

Index 0    Index 1   (destination array index, not page index)
&b11111111 01010100
                 ^

new bit position: 14
new remainder   : 2

等等剩下的索引值。有关如何在JavaScript中执行此操作的实际代码，请参阅this answer - 本答案中的代码不涉及字符串编码本身，但它显示了如何逐位移位字节缓冲区，这实际上是你需要完成这项任务。

要计算余数步骤，请使用8位减去自定义位范围：

step = 8 - newRange (here 7) => 1

这也是开始余数。对于每个字符，您将在处理之后将步骤添加到余数，但在使用它时，请记住使用模8（字节宽度）：

remainder += step; numOfBitsToShift = remainder % 8;

比特位置当然使用比特范围，在本例中为7：

bitPosition += 7;

然后找到你正在处理的索引，你将bitPosition除以8，如果有任何小数你必须处理两个索引（旧的和新的），如果没有小数，当前位置只代表新的索引（只有转换是当前指数值需要。）

您也可以使用modulo，当 remainder = step 的模数时，您知道您正在处理目标中的单个索引。

要计算最终长度，您将使用字符串的位长和长度，然后 ceil 结果，以便所有字符都适合8字节的字节数组，这是我们唯一的数组可以使用JavaScript：

dstLength = Math.ceil(7 * str.length / 8);

要解码，只需撤消所有步骤。

另一种选择，如果您使用长字符串或必须快速前进，则使用已建立的压缩器，例如zlib，其具有非常紧凑的标头以及在链接的情况下在JavaScript中的良好性能解。这也将处理字符串中的“模式”，以进一步优化结果大小。

免责声明：由于这主要是理论上的答案，因此可能存在一些错误。如果发现有任何意见，请随时发表评论。有关实际代码示例，请参阅链接的答案。

Answer 3

有关完整代码，请参阅此处：https://repl.it/NyMl/1

使用Uint8Array可以使用字节。

let msg = "This is some message";

let data = []

for(let i = 0; i < msg.length; ++i){
  data[i] = msg.charCodeAt(i);
}

let i8 = new Uint8Array(data);
let i16 = new Uint16Array(i8.buffer);

您还可以考虑这样的压缩：http://pieroxy.net/blog/pages/lz-string/demo.html

如果您不想使用第三方库，基于lz的压缩应该相当简单。见here (wikipedia)

Answer 4

我使用上面提到的相同的库，lz-string https://github.com/pieroxy/lz-string，它创建的文件大小比大多数二进制格式（如协议缓冲区）小。

我通过这样的Node.js进行压缩：

var compressedString = LZString.compressToUTF16(str);

我像这样解压缩客户端：

var decompressedString = LZString.decompressFromUTF16(str);

使用Javascript中的utf-32编码缩短utf-8字符串？

4 个答案: