Question

我将UTF-8字符串的二进制表示形式作为数值数组，每个数字值的范围为0..255。

如何使用jq将该数组转换为字符串？内置implode仅处理代码点数组。此外，jq中没有按位操作的功能。

此类数组在newman（Postman CLI）json输出中显示为值，属性response.stream.data。

例如，字符串“嗨，Мир！”进入[72,105,44,32,208,156,208,184,209,128,33]字节数组，而其代码点为[72,105,44,32,1052,1080,1088,33]。后者的implode给出原始字符串，而前者的implode给出“嗨，ÐÐ¸Ñ！”或类似的东西。

Answer 1

def btostring:
  if length == 0 then ""

  elif .[0] >= 240 then
     ([((((.[0] - 240) * 64) + (.[1] - 128)) * 64 + (.[2] - 128)) * 64 + (.[3] - 128)]
      | implode) + (.[4:] | btostring)

  elif .[0] >= 224 then
     ([  ((.[0] - 224) * 64 +  (.[1] - 128)) * 64 + (.[2] - 128)]
      | implode) + (.[3:] | btostring)

  elif .[0] >= 128 then
     ([  (.[0] - 192)  * 64 +  (.[1] - 128) ]
      | implode) + (.[2:] | btostring)

  else  (.[0:1] | implode ) + (.[1:] | btostring)
  end;

实施例

def hi: [72,105,44,32,208,156,208,184,209,128,33] ;

hi | btostring

输出（使用jq -r）：

Hi, Мир!

扩展示例：

def hi: [72,105,44,32,208,156,208,184,209,128,33];
def euro: [226,130,172];        # 11100010 10000010 10101100
def fire: [240,159,156,130];    # 11110000 10011111 10011100 10000010

(hi, euro, fire) | btostring

输出：

Hi, Мир!
€

（在某些设备上，上面的最后一行是一个方框而不是一个三角形。）

Answer 2

另一种使用foreach的方法。主要思想是保持当前char的剩余字节数（.[0]）和到目前为止读取的位（.[1]）。这是过滤器：

[foreach .[] as $item (
    [0, 0]
    ;
    if .[0] > 0 then [.[0] - 1, .[1] * 64 + ($item % 64)]
    elif $item >= 240 then [3, $item % 8]
    elif $item >= 224 then [2, $item % 16]
    elif $item >= 192 then [1, $item % 32]
    elif $item < 128 then [0, $item]
    else error("Malformed UTF-8 bytes")
    end
    ;
    if .[0] == 0 then .[1] else empty end
)] | implode

此外，对错误字节的错误检测尚未完成。

Answer 3

这是本页其他地方给出的递归btostring的非递归版本，主要是为了说明如何在jq中将递归实现转换为非递归实现。

def btostring:
  . as $in
  | [ foreach range(0;length) as $ix ({skip:0, .point:[]};
        if .skip > 0 then .skip += -1
        elif $in[$ix] >= 240 then
          .point = [(((($in[$ix]   - 240)  * 64)
                     + ($in[$ix+1] - 128)) * 64
                     + ($in[$ix+2] - 128)) * 64
                     + ($in[$ix+3] - 128)]
          | .skip = 3
        elif $in[$ix] >= 224 then
          .point = [  (($in[$ix]   - 224)  * 64
                     + ($in[$ix+1] - 128)) * 64
                     + ($in[$ix+2] - 128)]
          | .skip = 2
        elif $in[$ix] >= 128 then
          .point = [   ($in[$ix]   - 192)  * 64
                     + ($in[$ix+1] - 128)]
          | .skip = 1
        else .point = $in[$ix:$ix+1]
        end;
        if .skip == 0 then .point|implode else empty end) ]
  | add ;

Answer 4

根据我成功使用的answer of peak，这是一个不会因无效的UTF-8 encoding而中断或失败的扩展。

无效的UTF-8起始字节（128-193、245-255）或序列被解释为ISO 8859-1。

def btostring:
    if type != "array" then .
    elif length == 0 then ""
    elif .[0] >= 245 then
        (.[0:1] | implode ) + (.[1:] | btostring)
    elif .[0] >= 240 then
        if length >= 4 and .[1] >= 128 and .[2] >= 128 and .[3] >= 128 then
            ([((((.[0] - 240) * 64) + (.[1] - 128)) * 64 + (.[2] - 128)) * 64 + (.[3] - 128)] | implode) + (.[4:] | btostring)
        else
            (.[0:1] | implode ) + (.[1:] | btostring)
        end
    elif .[0] >= 224 then
        if length >= 3 and .[1] >= 128 and .[2] >= 128 then
            ([  ((.[0] - 224) * 64  + (.[1] - 128)) * 64 + (.[2] - 128)] | implode) + (.[3:] | btostring)
        else
            (.[0:1] | implode ) + (.[1:] | btostring)
        end
    elif .[0] >= 194 then
        if length >= 2 and .[1] >= 128 then
            ([   (.[0] - 192) * 64  + (.[1] - 128) ] | implode) + (.[2:] | btostring)
        else
            (.[0:1] | implode ) + (.[1:] | btostring)
        end
    else
        (.[0:1] | implode ) + (.[1:] | btostring)
    end;

使用jq将数字字节值数组转换为字符串

4 个答案:

实施例

扩展示例：