字典在python中没有给我独特的ID

时间:2017-10-04 15:04:22

标签: python

我将elasticsearch查询的输出保存在文件中。前几行看起来像这样:

{"took": 1,
   "timed_out": false,
   "_shards": {},
   "hits": {
      "total": 27,
      "max_score": 6.5157733,
      "hits": [
         {
            "_index": "dbgap_062617",
            "_type": "dataset",
            ***"_id": "595189d15152c64c3b0adf16"***,
            "_score": 6.5157733,
            "_source": {
               "dataAcquisition": {
                  "performedBy": "\n\t\tT\n\t\t"
               },
               "provenance": {
                  "ingestTime": "201",                     
               },
               "studyGroup": [
                  {
                     "Identifier": "1",
                     "name": "Diseas"
                  }
               ],
               "license": {
                  "downloadURL": "http",                      
               },
               "study": {
                  "alternateIdentifiers": "yes",
                },
               "disease": {
                  "name": [
                     "Coronary Artery Disease"
                  ]
               },
               "NLP_Fields": {
                  "CellLine": [],
                  "MeshID": [
                     "C0066533",                        
                  ],
                  "DiseaseID": [
                     "C0010068"
                  ],
                  "ChemicalID": [],
                  "Disease": [
                     "coronary artery disease"
                  ],
                  "Chemical": [],

                  "Meshterm": [
                     "migen",                        
                  ]
               },
               "datasetDistributions": [
                  {
                     "dateReleased": "20150312",                        
                  }
               ],
               "dataset": {
                  "citations": [
                     "20032323"
                  ],
                  **"description": "The Precoc.",**                  
                  **"title": "MIGen_ExS: PROCARDIS"**
               },
               .... and the list goes on with a bunch of other items ....

在所有这些节点中,我对Unique _Ids,title和description感兴趣。所以,我创建了一个字典,并使用json提取了我感兴趣的部分。这是我的代码:

import json
s={}
d=open('local file','w')
with open('localfile', 'r') as ready:
    for line in ready:
        test=json.loads(line, encoding='utf-8')
        for i in (test['hits']['hits']):
             for x in i:
                  s.setdefault(i['_id'], [i['_source']['dataset']
                  ['description'], i['_source']['dataset']['title']])
        for k, v in s.items():
        d.write(k +'\t'+v[0] +'\t' + v[1] + '\n')
d.close()

现在,当我运行它时,它会给我一个带有重复_Ids的文件!字典是不是要给我独特的_Ids?在我原来的输出文件中,我有很多重复的ID,我想摆脱它们。 另外,我只在_ids上运行set()来获取它们的唯一数量,它达到了138.但是如果我删除生成的重复ID,则使用字典,它会降至17! 有人可以告诉我为什么会这样吗?

1 个答案:

答案 0 :(得分:0)

如果您想要一个唯一的ID,如果您正在使用数据库,它将为您创建它。如果您不是,则需要生成唯一的数字或字符串。根据字典的创建方式,您可以使用创建字典时的时间戳,也可以使用uuid.uuid4()。有关uuid的更多信息,here are the docs