Question

由于处理语料库的原因，我不得不创建多个JSON文件（使用GNRD http://gnrd.globalnames.org/进行科学名称提取）。我现在想要使用这些JSON文件来注释整个语料库。

我正在尝试在Python中合并多个JSON文件。每个JSON文件的内容都是仅科学名称（键）和找到的名称（值）的数组。下面是一个较短文件的示例：

{  
  "file":"biodiversity_trophic_9.txt",
  "names":[  
    {  
      "scientificName":"Bufo"
    },
    {  
      "scientificName":"Eleutherodactylus talamancae"
    },
    {  
      "scientificName":"E. punctariolus"
    },
    {  
      "scientificName":"Norops lionotus"
    },
    {  
      "scientificName":"Centrolenella prosoblepon"
    },
    {  
      "scientificName":"Sibon annulatus"
    },
    {  
      "scientificName":"Colostethus flotator"
    },
    {  
      "scientificName":"C. inguinalis"
    },
    {  
      "scientificName":"Eleutherodactylus"
    },
    {  
      "scientificName":"Hyla columba"
    },
    {  
      "scientificName":"Bufo haematiticus"
    },
    {  
      "scientificName":"S. annulatus"
    },
    {  
      "scientificName":"Leptodeira septentrionalis"
    },
    {  
      "scientificName":"Imantodes cenchoa"
    },
    {  
      "scientificName":"Oxybelis brevirostris"
    },
    {  
      "scientificName":"Cressa"
    },
    {  
      "scientificName":"Coloma"
    },
    {  
      "scientificName":"Perlidae"
    },
    {  
      "scientificName":"Hydropsychidae"
    },
    {  
      "scientificName":"Hyla"
    },
    {  
      "scientificName":"Norops"
    },
    {  
      "scientificName":"Hyla colymbiphyllum"
    },
    {  
      "scientificName":"Colostethus inguinalis"
    },
    {  
      "scientificName":"Oxybelis"
    },
    {  
      "scientificName":"Rana warszewitschii"
    },
    {  
      "scientificName":"R. warszewitschii"
    },
    {  
      "scientificName":"Rhyacophilidae"
    },
    {  
      "scientificName":"Daphnia magna"
    },
    {  
      "scientificName":"Hyla colymba"
    },
    {  
      "scientificName":"Centrolenella"
    },
    {  
      "scientificName":"Orconectes nais"
    },
    {  
      "scientificName":"Orconectes neglectus"
    },
    {  
      "scientificName":"Campostoma anomalum"
    },
    {  
      "scientificName":"Caridina"
    },
    {  
      "scientificName":"Decapoda"
    },
    {  
      "scientificName":"Atyidae"
    },
    {  
      "scientificName":"Cerastoderma edule"
    },
    {  
      "scientificName":"Rana aurora"
    },
    {  
      "scientificName":"Riffle"
    },
    {  
      "scientificName":"Calopterygidae"
    },
    {  
      "scientificName":"Elmidae"
    },
    {  
      "scientificName":"Gyrinidae"
    },
    {  
      "scientificName":"Gerridae"
    },
    {  
      "scientificName":"Naucoridae"
    },
    {  
      "scientificName":"Oligochaeta"
    },
    {  
      "scientificName":"Veliidae"
    },
    {  
      "scientificName":"Libellulidae"
    },
    {  
      "scientificName":"Philopotamidae"
    },
    {  
      "scientificName":"Ephemeroptera"
    },
    {  
      "scientificName":"Psephenidae"
    },
    {  
      "scientificName":"Baetidae"
    },
    {  
      "scientificName":"Corduliidae"
    },
    {  
      "scientificName":"Zygoptera"
    },
    {  
      "scientificName":"B. buto"
    },
    {  
      "scientificName":"C. euknemos"
    },
    {  
      "scientificName":"C. ilex"
    },
    {  
      "scientificName":"E. padi noblei"
    },
    {  
      "scientificName":"E. padi"
    },
    {  
      "scientificName":"E. bufo"
    },
    {  
      "scientificName":"E. butoni"
    },
    {  
      "scientificName":"E. crassi"
    },
    {  
      "scientificName":"E. cruentus"
    },
    {  
      "scientificName":"H. colymbiphyllum"
    },
    {  
      "scientificName":"N. aterina"
    },
    {  
      "scientificName":"S. ilex"
    },
    {  
      "scientificName":"Anisoptera"
    },
    {  
      "scientificName":"Riffle delta"
    }
  ],
  "total":67,
  "status":200,
  "unique":true,
  "engines":[  
    "TaxonFinder",
    "NetiNeti"
  ],
  "verbatim":false,
  "input_url":null,
  "token_url":"http://gnrd.globalnames.org/name_finder.html?token=2rtc4e70st",
  "parameters":{  
    "engine":0,
    "return_content":false,
    "best_match_only":false,
    "data_source_ids":[  

    ],
    "detect_language":true,
    "all_data_sources":false,
    "preferred_data_sources":[  

    ]
  },
  "execution_time":{  
    "total_duration":3.1727607250213623,
    "find_names_duration":1.9656541347503662,
    "text_preparation_duration":1.000107765197754
  },
  "english_detected":true
}

我遇到的问题是文件中可能存在重复项，我想要删除（否则我可以将文件连接起来）。我看到的查询是指合并额外的键和值来扩展数组本身。

有人可以就如何克服这个问题给我指导吗？

Answer 1

如果我理解正确，你想得到所有＆＃34; scientificNames＆＃34; ＆＃34;名称＆＃34;中的值一批文件的元素。如果我错了，你应该给出一个预期的输出，以便更容易理解。

我做了类似的事情：

file_path = 'file.ent.gz' # path to current file

file = gzip.open(file_path, 'rb') 
content = file.read() # its content

write_path = 'temp.ent' # let's write it back but as ordinary file
write_file = open(write_path, 'w')
write_file.write(content)
write_file.close()

third_file = open(write_path, 'r') #open this ordinary noncompressed file

structure = parser.get_structure('', third_file) #and finally put it into the parser

如果您想获得与初始文件类似的格式，请添加：

all_names = set() # use a set to avoid duplicates

# put all your files in there
for filename in ('file1.json', 'file2.json', ....):
    try:
        with open(filename, 'rt') as finput:
            data = json.load(finput)
        for name in data.get('names'):
            all_names.add(name.get('scientificName')
    except Exception as exc:
        print("Skipped file {} because exception {}".format(filename, str(exc))

print(all_names)

如何在Python中合并多个JSON文件

1 个答案: