浏览大型JSON文件

时间:2017-01-05 18:03:47

标签: json path key schema jq

我有一个巨大的JSON文件,其中包含一些非常深的路径。我希望使用jq来显示隐藏更深层内容的前N个键。然后,一旦我找到了我感兴趣的按键,继续向下钻取,只显示我从起点开始的N级,类似于文本编辑器折叠N级以下的所有内容。这可能吗?

3 个答案:

答案 0 :(得分:1)

如果您对查看特定深度的对象感兴趣,可以使用getpathpathspaths将返回图表中所有值的路径。您可以将这些路径过滤到特定长度的路径,然后使用getpath获取相应的值。

例如,查看当前对象的深度为3的所有值

getpath(paths | select(length == 3))

然后你可以随意过滤并缩小范围。

答案 1 :(得分:0)

Appended是一个jq模式推理程序,可用于理解大型JSON对象或JSON实体数组的结构,至少在它背后有一些押韵或原因时。

用法:如果感兴趣的JSON实体在文件input.json中,那么假设下面的程序在schema.jq中,运行:

jq -f schema.jq input.json

对于一个非常大的文件,模式推断可能会很慢,但通常使用这种方式比使用某种迭代方法更快。例如,请参阅下面给出的示例后面的评论。

实施例

这是一个使用JSON = JEOPARDY_QUESTIONS1.json的示例,一个54MB的文件(55554625字节) 可从https://raw.githubusercontent.com/alicemaz/super_jeopardy/master/JEOPARDY_QUESTIONS1.json

获取
$ time jq -c -f schema.jq $JSON
[
  {
    "air_date": "string",
    "answer": "string",
    "category": "string",
    "question": "string",
    "round": "string",
    "show_number": "string",
    "value": "string"
  }
]

real    0m12.868s
user    0m11.713s
sys     0m0.342s

u + s的时间值得注意,因为使用流解析器生成路径概要(参见本页的synopsis.jq),在同一台机器上的u + s时间约为三分之二。鉴于JSON文件是一个长度为216,930的数组,这可能是违反直觉的。

schema.jq

# Schema inference
# Version 0.1
# Author: pkoppstein at gmail dot com
# Requires: jq 1.4 or higher

# This module defines three filters:
#   typeof/0 returns the extended-type of its input;
#   typeUnion(a;b) returns the union of the two specified extended-type values;
#   schema/0 returns the typeUnion of the extended-type values of the entities
#    in the input array, if the input is an array,
#     otherwise it simply returns the "typeof" value of its input.

# Each extended type can be thought of as a set of JSON entities,
# e.g. "number" for the set of JSON numbers, and ["number"] for the
# set of JSON number-valued arrays including [].

# The extended-type values are always JSON entities.
# The possible values are:
# "null", "boolean", "string", "number";
# "scalar" for any combination of non-null scalars;
# [T] where T is an extended type;
# an object all of whose values are extended types;
# "JSON" signifying that no other extended-type value is applicable.

# The extended-type values are defined recursively:
# The extended-type of a scalar value is its JSON type.
# The extended-type of a non-empty array of values all of which have the
#      same JSON type, t, is [t], and similarly for ["scalar"], and ["JSON"].
# The extended-type of [] is ["null"], since that is the extended type of all arrays
#     which have no elements other than null.
# The extended-type of an object is an object with the same keys, but the
#     values of which are the extended-types of the corresponding values.

# typeUnion(a;b) returns the least extended-type value that subsumes both a and b.
# For example:
#  typeUnion("number"; "string") yields "scalar";
#  typeUnion({"a": "number"}; {"b": "string"}) yields {"a": "number", "b": "string"};
#  typeUnion("null", t) yields t for any valid extended type, t.

def typeUnion(a;b):
  def scalarp: . == "boolean" or . == "string" or . == "number" or . == "scalar";
  a as $a | b as $b
  | if $a == $b then $a
    elif ($a | scalarp) and ($b | scalarp) then "scalar"
    elif $a == "JSON" or $b == "JSON" then "JSON"
    elif ($a|type) == "array" and ($b|type) == "array" then [ typeUnion($a[0]; $b[0]) ]
    elif ($a|type) == "object" and ($b|type) == "object" then
      ((($a|keys) + ($b|keys)) | unique) as $keys
      | reduce $keys[] as $key ( {} ; .[$key] = typeUnion( $a[$key]; $b[$key]) )
    elif $a == "null" or $a == null then $b
    elif $b == "null" or $b == null then $a
    else "JSON"
    end ;

def typeof:
  def typeofArray:
    if length == 0 then ["null"]
    else [reduce .[] as $item (null; typeUnion(.; $item|typeof))]
    end ;
  def typeofObject:
    reduce keys[] as $key (. ; .[$key] |= typeof) ;

  . as $in
  | type
  | if . == "string" or . == "number" or . == "null" or . == "boolean" then .
    elif . == "object" then $in | typeofObject
    else $in | typeofArray
    end ;

# Omit the outermost [] for an array
def schema:
  if type == "array" then reduce .[] as $x ("null";  typeUnion(.; $x|typeof))
  else typeof
  end ;



# Example top-level:
schema

答案 2 :(得分:0)

这是一个过滤器,它发出所有路径的概要流 长度< =输入实体中的深度,除非深度< = 0, 深度限制被忽略。

路径[p1,p2,...]的概要是通过替换来构造的 使用"。[]"的整数组件,并使用"前缀字符串组件。", 所以例如,如果i和j是整数,那么 [i," keyname",j]将表示为。[] .keyname。[]

以下是使用jq -r生成的输出示例:

.[]
.[].data
.[].data.children
.[].data.modhash
.[].kind

paths_synopsis / 1

# If depth<0 then select paths of length equal to -depth    
def paths_synopsis(depth):
  [ paths
  | if depth > 0 then select(length <= depth)
    elif (depth < 0) then select(length == -depth)
    else . end
  | [.[]|if type=="number" then "[]" else . end]]
  | unique
  | .[]
  | "." + join(".")
  ;

非常大的JSON实体

jq有一个流分析器,用于非常大的JSON实体。

以下过滤器适用于jq流解析器(jq --stream) 在管道中,其第二个组成部分统一了概要,如本例所示:

jq --arg depth 0 -c --stream -f synopsis.jq input.json | sort -u

在以下公式中,必须在命令行中指定所需的DEPTH限制。 指定0表示无限制。

synopsis.jq
# Usage: jq --arg depth DEPTH -c --stream -f synopsis.jq input.json | sort -u
# or:    jq --arg depth DEPTH -c --stream -f synopsis.jq input.json | jq -s -c unique[]
def synopsis(depth):
  select(length == 2)
  | .[0]
  | if depth > 0 then select(length <= depth)
    elif (depth < 0) then select(length == -depth)
    else . end
  | map( if type=="number" then [] else . end) ;

synopsis( $depth | if . then tonumber else 0 end )

实施例

curl -Ss 'http://forecast.weather.gov/MapClick.php?FcstType=json&lat=39.56&lon=-104.85' |
  jq --arg depth 0 -c --stream -f synopsis.jq |
  sort -u | head -n 50

["creationDate"] ["creationDateLocal"] ["credit"] ["currentobservation","Altimeter"] ["currentobservation","Date"] ["currentobservation","Dewp"] ["currentobservation","Gust"] ["currentobservation","Relh"] ["currentobservation","SLP"] ["currentobservation","Temp"] ["currentobservation","Visibility"] ["currentobservation","Weather"] ["currentobservation","Weatherimage"] ["currentobservation","WindChill"] ["currentobservation","Windd"] ["currentobservation","Winds"] ["currentobservation","elev"] ["currentobservation","id"] ["currentobservation","latitude"] ["currentobservation","longitude"] ["currentobservation","name"] ["currentobservation","state"] ["currentobservation","timezone"] ["data","hazard",[]] ["data","hazardUrl",[]]