我将elasticsearch查询的输出保存在文件中。前几行看起来像这样:
{"took": 1,
"timed_out": false,
"_shards": {},
"hits": {
"total": 27,
"max_score": 6.5157733,
"hits": [
{
"_index": "dbgap_062617",
"_type": "dataset",
***"_id": "595189d15152c64c3b0adf16"***,
"_score": 6.5157733,
"_source": {
"dataAcquisition": {
"performedBy": "\n\t\tT\n\t\t"
},
"provenance": {
"ingestTime": "201",
},
"studyGroup": [
{
"Identifier": "1",
"name": "Diseas"
}
],
"license": {
"downloadURL": "http",
},
"study": {
"alternateIdentifiers": "yes",
},
"disease": {
"name": [
"Coronary Artery Disease"
]
},
"NLP_Fields": {
"CellLine": [],
"MeshID": [
"C0066533",
],
"DiseaseID": [
"C0010068"
],
"ChemicalID": [],
"Disease": [
"coronary artery disease"
],
"Chemical": [],
"Meshterm": [
"migen",
]
},
"datasetDistributions": [
{
"dateReleased": "20150312",
}
],
"dataset": {
"citations": [
"20032323"
],
**"description": "The Precoc.",**
**"title": "MIGen_ExS: PROCARDIS"**
},
.... and the list goes on with a bunch of other items ....
在所有这些节点中,我对Unique _Ids,title和description感兴趣。所以,我创建了一个字典,并使用json提取了我感兴趣的部分。这是我的代码:
import json
s={}
d=open('local file','w')
with open('localfile', 'r') as ready:
for line in ready:
test=json.loads(line, encoding='utf-8')
for i in (test['hits']['hits']):
for x in i:
s.setdefault(i['_id'], [i['_source']['dataset']
['description'], i['_source']['dataset']['title']])
for k, v in s.items():
d.write(k +'\t'+v[0] +'\t' + v[1] + '\n')
d.close()
现在,当我运行它时,它会给我一个带有重复_Ids的文件!字典是不是要给我独特的_Ids?在我原来的输出文件中,我有很多重复的ID,我想摆脱它们。 另外,我只在_ids上运行set()来获取它们的唯一数量,它达到了138.但是如果我删除生成的重复ID,则使用字典,它会降至17! 有人可以告诉我为什么会这样吗?
答案 0 :(得分:0)
如果您想要一个唯一的ID,如果您正在使用数据库,它将为您创建它。如果您不是,则需要生成唯一的数字或字符串。根据字典的创建方式,您可以使用创建字典时的时间戳,也可以使用uuid.uuid4()。有关uuid的更多信息,here are the docs。