由于处理语料库的原因,我不得不创建多个JSON文件(使用GNRD http://gnrd.globalnames.org/进行科学名称提取)。我现在想要使用这些JSON文件来注释整个语料库。
我正在尝试在Python中合并多个JSON文件。每个JSON文件的内容都是仅科学名称(键)和找到的名称(值)的数组。下面是一个较短文件的示例:
{
"file":"biodiversity_trophic_9.txt",
"names":[
{
"scientificName":"Bufo"
},
{
"scientificName":"Eleutherodactylus talamancae"
},
{
"scientificName":"E. punctariolus"
},
{
"scientificName":"Norops lionotus"
},
{
"scientificName":"Centrolenella prosoblepon"
},
{
"scientificName":"Sibon annulatus"
},
{
"scientificName":"Colostethus flotator"
},
{
"scientificName":"C. inguinalis"
},
{
"scientificName":"Eleutherodactylus"
},
{
"scientificName":"Hyla columba"
},
{
"scientificName":"Bufo haematiticus"
},
{
"scientificName":"S. annulatus"
},
{
"scientificName":"Leptodeira septentrionalis"
},
{
"scientificName":"Imantodes cenchoa"
},
{
"scientificName":"Oxybelis brevirostris"
},
{
"scientificName":"Cressa"
},
{
"scientificName":"Coloma"
},
{
"scientificName":"Perlidae"
},
{
"scientificName":"Hydropsychidae"
},
{
"scientificName":"Hyla"
},
{
"scientificName":"Norops"
},
{
"scientificName":"Hyla colymbiphyllum"
},
{
"scientificName":"Colostethus inguinalis"
},
{
"scientificName":"Oxybelis"
},
{
"scientificName":"Rana warszewitschii"
},
{
"scientificName":"R. warszewitschii"
},
{
"scientificName":"Rhyacophilidae"
},
{
"scientificName":"Daphnia magna"
},
{
"scientificName":"Hyla colymba"
},
{
"scientificName":"Centrolenella"
},
{
"scientificName":"Orconectes nais"
},
{
"scientificName":"Orconectes neglectus"
},
{
"scientificName":"Campostoma anomalum"
},
{
"scientificName":"Caridina"
},
{
"scientificName":"Decapoda"
},
{
"scientificName":"Atyidae"
},
{
"scientificName":"Cerastoderma edule"
},
{
"scientificName":"Rana aurora"
},
{
"scientificName":"Riffle"
},
{
"scientificName":"Calopterygidae"
},
{
"scientificName":"Elmidae"
},
{
"scientificName":"Gyrinidae"
},
{
"scientificName":"Gerridae"
},
{
"scientificName":"Naucoridae"
},
{
"scientificName":"Oligochaeta"
},
{
"scientificName":"Veliidae"
},
{
"scientificName":"Libellulidae"
},
{
"scientificName":"Philopotamidae"
},
{
"scientificName":"Ephemeroptera"
},
{
"scientificName":"Psephenidae"
},
{
"scientificName":"Baetidae"
},
{
"scientificName":"Corduliidae"
},
{
"scientificName":"Zygoptera"
},
{
"scientificName":"B. buto"
},
{
"scientificName":"C. euknemos"
},
{
"scientificName":"C. ilex"
},
{
"scientificName":"E. padi noblei"
},
{
"scientificName":"E. padi"
},
{
"scientificName":"E. bufo"
},
{
"scientificName":"E. butoni"
},
{
"scientificName":"E. crassi"
},
{
"scientificName":"E. cruentus"
},
{
"scientificName":"H. colymbiphyllum"
},
{
"scientificName":"N. aterina"
},
{
"scientificName":"S. ilex"
},
{
"scientificName":"Anisoptera"
},
{
"scientificName":"Riffle delta"
}
],
"total":67,
"status":200,
"unique":true,
"engines":[
"TaxonFinder",
"NetiNeti"
],
"verbatim":false,
"input_url":null,
"token_url":"http://gnrd.globalnames.org/name_finder.html?token=2rtc4e70st",
"parameters":{
"engine":0,
"return_content":false,
"best_match_only":false,
"data_source_ids":[
],
"detect_language":true,
"all_data_sources":false,
"preferred_data_sources":[
]
},
"execution_time":{
"total_duration":3.1727607250213623,
"find_names_duration":1.9656541347503662,
"text_preparation_duration":1.000107765197754
},
"english_detected":true
}
我遇到的问题是文件中可能存在重复项,我想要删除(否则我可以将文件连接起来)。我看到的查询是指合并额外的键和值来扩展数组本身。
有人可以就如何克服这个问题给我指导吗?
答案 0 :(得分:1)
如果我理解正确,你想得到所有" scientificNames" "名称"中的值一批文件的元素。如果我错了,你应该给出一个预期的输出,以便更容易理解。
我做了类似的事情:
file_path = 'file.ent.gz' # path to current file
file = gzip.open(file_path, 'rb')
content = file.read() # its content
write_path = 'temp.ent' # let's write it back but as ordinary file
write_file = open(write_path, 'w')
write_file.write(content)
write_file.close()
third_file = open(write_path, 'r') #open this ordinary noncompressed file
structure = parser.get_structure('', third_file) #and finally put it into the parser
如果您想获得与初始文件类似的格式,请添加:
all_names = set() # use a set to avoid duplicates
# put all your files in there
for filename in ('file1.json', 'file2.json', ....):
try:
with open(filename, 'rt') as finput:
data = json.load(finput)
for name in data.get('names'):
all_names.add(name.get('scientificName')
except Exception as exc:
print("Skipped file {} because exception {}".format(filename, str(exc))
print(all_names)