Jq:为每个对象收集JSON数据

时间:2018-05-22 15:04:10

标签: json csv jq data-extraction

我发现我的标题很不清楚,但我不知道如何更好地重写它,所以随时编辑它!

数据
我有以下(简化)JSON:

[
  {
    "genes_id": "eco:b0002",
    "entry_id": "b0002",
    "division": "CDS",
    "organism": "Escherichia coli K-12 MG1655",
    "organism_code": "eco",
    "organism_id": "T00007",
    "name": "thrA",
    "names": [
      "thrA"
    ],
    "definition": "(RefSeq) Bifunctional aspartokinase/homoserine dehydrogenase 1",
    "eclinks": [

    ],
    "orthologs": {
      "K12524": "bifunctional aspartokinase / homoserine dehydrogenase 1 [EC:2.7.2.4 1.1.1.3]"
    },
    "pathways": {
      "eco00260": "Glycine, serine and threonine metabolism",
      "eco00261": "Monobactam biosynthesis",
      "eco00270": "Cysteine and methionine metabolism",
      "eco00300": "Lysine biosynthesis",
      "eco01100": "Metabolic pathways",
      "eco01110": "Biosynthesis of secondary metabolites",
      "eco01120": "Microbial metabolism in diverse environments",
      "eco01130": "Biosynthesis of antibiotics",
      "eco01230": "Biosynthesis of amino acids"
    },
    "modules": {
      "eco_M00016": "Lysine biosynthesis, succinyl-DAP pathway, aspartate => lysine",
      "eco_M00017": "Methionine biosynthesis, apartate => homoserine => methionine",
      "eco_M00018": "Threonine biosynthesis, aspartate => homoserine => threonine"
    },
    "classes": [

    ],
    "position": "337..2799",
    "chromosome": null,
    "gbposition": "337..2799",
    "motifs": {
      "Pfam": [
        "Homoserine_dh",
        "AA_kinase",
        "NAD_binding_3",
        "ACT_7",
        "ACT",
        "Sacchrp_dh_NADP"
      ]
    },
    "dblinks": {
      "NCBI-GeneID": [
        "945803"
      ],
      "NCBI-ProteinID": [
        "NP_414543"
      ],
      "Pasteur": [
        "thrA"
      ],
      "RegulonDB": [
        "ECK120000987"
      ],
      "ECOCYC": [
        "EG10998"
      ],
      "ASAP": [
        "ABE-0000008"
      ],
      "UniProt": [
        "P00561"
      ]
    }
  },
  {
    "genes_id": "eco:b0003",
    "entry_id": "b0003",
    "division": "CDS",
    "organism": "Escherichia coli K-12 MG1655",
    "organism_code": "eco",
    "organism_id": "T00007",
    "name": "thrB",
    "names": [
      "thrB"
    ],
    "definition": "(RefSeq) homoserine kinase",
    "eclinks": [

    ],
    "orthologs": {
      "K00872": "homoserine kinase [EC:2.7.1.39]"
    },
    "pathways": {
      "eco00260": "Glycine, serine and threonine metabolism",
      "eco01100": "Metabolic pathways",
      "eco01110": "Biosynthesis of secondary metabolites",
      "eco01120": "Microbial metabolism in diverse environments",
      "eco01230": "Biosynthesis of amino acids"
    },
    "modules": {
      "eco_M00018": "Threonine biosynthesis, aspartate => homoserine => threonine"
    },
    "classes": [

    ],
    "position": "2801..3733",
    "chromosome": null,
    "gbposition": "2801..3733",
    "motifs": {
      "Pfam": [
        "GHMP_kinases_N",
        "GHMP_kinases_C"
      ]
    },
    "dblinks": {
      "NCBI-GeneID": [
        "947498"
      ],
      "NCBI-ProteinID": [
        "NP_414544"
      ],
      "Pasteur": [
        "thrB"
      ],
      "RegulonDB": [
        "ECK120000988"
      ],
      "ECOCYC": [
        "EG10999"
      ],
      "ASAP": [
        "ABE-0000010"
      ],
      "UniProt": [
        "P00547"
      ]
    }
  }
]

期望的输出
这是两个对象的数组。我对两个对象的genes_idpathways感兴趣,并希望获得包含以下内容的制表符分隔文件:

eco:b0002   eco00260    Glycine, serine and threonine metabolism
eco:b0002   eco00261    Monobactam biosynthesis
eco:b0002   eco00270    Cysteine and methionine metabolism
eco:b0002   eco00300    Lysine biosynthesis
eco:b0002   eco01100    Metabolic pathways
eco:b0002   eco01110    Biosynthesis of secondary metabolites
eco:b0002   eco01120    Microbial metabolism in diverse environments
eco:b0002   eco01130    Biosynthesis of antibiotics
eco:b0002   eco01230    Biosynthesis of amino acids
eco:b0003   eco00260    Glycine, serine and threonine metabolism
eco:b0003   eco01100    Metabolic pathways
eco:b0003   eco01110    Biosynthesis of secondary metabolites
eco:b0003   eco01120    Microbial metabolism in diverse environments
eco:b0003   eco01230    Biosynthesis of amino acids

我发现了什么 我知道可以采用以下格式提取数据:

eco:b0002: list of pathways and ids
eco:b0003: list of pathways and ids

但是我想传播路径到各行,如上例所示。我找不到有关如何使用jq执行此操作的任何信息,因此怀疑这是否实际可行。因此,如果可能的话,如何使用Jq实现这一目标?

1 个答案:

答案 0 :(得分:1)

调用:jq -rf totsv.jq input.json

程序(totsv.jq):

.[]
| .genes_id as $id
| .pathways
| to_entries[]
| [$id, .key, .value]
| @tsv

TSV是一个不错的选择(和jq一样)!