我有一个json文件input.json
,其数据格式如下:
{"userid":"04f","clients":[1,2]}
{"userid":"07f","clients":[1,6,7]}
{"userid":"082","clients":[2,6,1]}
{"userid":"0c1","clients":[3,9,8]}
{"userid":"13f","clients":[4]}
clients数组可以包含1-10的数字,可能有多个元素但没有重复。我想对这个文件执行按位操作。
我期待这样的输出(对于客户端数组元素的按位OR运算):
{"userid":"04f","clients":3} #$((1|2))=3
{"userid":"07f","clients":7} #$((1|6|7))=7
{"userid":"082","clients":7} #$((1|6|2))=7
{"userid":"0c1","clients":11} #$((3|9|8))=11
{"userid":"13f","clients":4} #$((4))=4
我的文件大约有2.5亿行。我在寻找bash的解决方案。什么是实现这一目标的最快和最佳方式?
答案 0 :(得分:1)
不幸的是jq
还不支持按位操作。我建议写一个小的Python程序:
from collections import OrderedDict
from functools import reduce
import json
with open('file.json', 'r') as fd:
for line in fd:
data = json.loads(line, object_pairs_hook=OrderedDict)
data['clients'] = reduce(lambda x,y : x|y, data['clients'])
print(json.dumps(data))
输出:
{"userid": "04f", "clients": 3}
{"userid": "07f", "clients": 7}
{"userid": "082", "clients": 7}
{"userid": "0c1", "clients": 11}
{"userid": "13f", "clients": 4}
答案 1 :(得分:1)
以下内容基于https://rosettacode.org/wiki/Non-decimal_radices/Convert#jq处提供的两个通用过滤器(convert/1
和to_i/1
)
它们的定义包括在下面,以便于完整性和易于参考。
# input: an array of decimal numbers
def bitwise_or:
map(convert(2) | explode | reverse | map(.-48))
| transpose | map(max)
| reverse
| join("")
| to_i(2) ;
.clients |= bitwise_or
# Convert the input integer to a string in the specified base (2 to 36 inclusive)
def convert(base):
def stream:
recurse(if . > 0 then ./base|floor else empty end) | . % base ;
if . == 0 then "0"
else [stream] | reverse | .[1:]
| if base < 10 then map(tostring) | join("")
elif base <= 36 then map(if . < 10 then 48 + . else . + 87 end) | implode
else error("base too large")
end
end;
# input string is converted from "base" to an integer, within limits
# of the underlying arithmetic operations, and without error-checking:
def to_i(base):
explode
| reverse
| map(if . > 96 then . - 87 else . - 48 end) # "a" ~ 97 => 10 ~ 87
| reduce .[] as $c
# state: [power, ans]
([1,0]; (.[0] * base) as $b | [$b, .[1] + (.[0] * $c)])
| .[1];
答案 2 :(得分:0)
一种方式(因为你说bash)将使用awk
。
tr -d "[]}" <input.json | awk -F ":" '{split($3,a,",") ;o=0;for (i in a) {o = or(o,a[i])};print $1":"$2":"o"}" }'
awk具有按位OR
功能 - 用作or(arg1,arg2,..argn)
。
tr -d "[]}"
用于在执行操作之前消除额外字符。
split()
将分隔符(,)分隔值存储到数组中。
这给出了:
{"userid":"04f","clients":3}
{"userid":"07f","clients":7}
{"userid":"082","clients":7}
{"userid":"0c1","clients":11}
{"userid":"13f","clients":4}
注意:这可能不适用于其他一些json格式。
答案 3 :(得分:0)
这是一个jq解决方案。 Project
中的常量128可以更改为对数据有意义的任何值(或者甚至可以用返回常量流的简单函数替换它)
twopowers
再考虑一下,我们可以通过使用def twopowers: # return sequence of powers of 2
128 # largest power (change as desired)
| log2 as $maxp # e.g. 7
| $maxp - range($maxp+1) # 7, 6, 5, 4, 3, 2, 1, 0
| pow(2; .) # 128, 64, 32, 16, 8, 4, 2, 1
;
def base2powers: # e.g 81 -> [0,64,0,16,0,0,0,1]
[
foreach twopowers as $p (
{ v: . }
; .diff = .v - $p
| .v = if .diff >= 0 then .diff else .v end
| .bit = if .diff >= 0 then 1 else 0 end
; .bit * $p
)
]
;
def combine: # given an array of base2powers arrays
reduce .[] as $a ( # compute the element-wise max array
[] # and return its sum
; [ . as $b
| $a
| range(length)
| [ $a[.], $b[.] ]
| max
]
)
| add
;
.clients = (.clients | map(base2powers) | combine)
数组中的最大值来计算每个输入使用的功率,从而消除twopowers
中的常量。这是一个执行此操作的版本。
.clients
Nishant Kumar观察到def twopowers_v2: # return sequence of powers of 2 less than given value
. # e.g. 129
| log2 # 7.011227255423254
| floor as $maxp # 7
| $maxp - range($maxp+1) # 7, 6, 5, 4, 3, 2, 1, 0
| pow(2; .) # 128, 64, 32, 16, 8, 4, 2, 1
;
def base2powers_v2($powers): # e.g 81 -> [64,0,16,0,0,0,1]
[
foreach $powers[] as $p (
{ v: . }
; .diff = .v - $p
| .v = if .diff >= 0 then .diff else .v end
| .pow = if .diff >= 0 then $p else 0 end
; .pow
)
]
;
.clients = (
.clients
| [max|twopowers_v2] as $powers
| map(base2powers_v2($powers))
| combine
)
是.clients
,最终结果为[0]
。这是因为null
不返回任何值。为了弥补这一点,我们可以添加一个明确的检查:
0 | twopowers_v2
看peak's second solution我注意到两件事:
def twopowers_v3: # return sequence of powers of 2 less than given value
if . > 0 then # e.g. 129
log2 # 7.011227255423254
| floor as $maxp # 7
| $maxp - range($maxp+1) # 7, 6, 5, 4, 3, 2, 1, 0
| pow(2; .) # 128, 64, 32, 16, 8, 4, 2, 1
else #
0 # but if input is 0, return 0
end #
;
.clients = (
.clients
| [max|twopowers_v3] as $powers
| map(base2powers_v2($powers))
| combine
)
与combine
elementwise(max) | add
与elementwise(max)
以下是没有transpose | map(max)
combine
还使用&#34; little-endian&#34;比特数组的表示比这种方法更简单。