Question

我将elasticsearch查询的输出保存在文件中。前几行看起来像这样：

{"took": 1,
   "timed_out": false,
   "_shards": {},
   "hits": {
      "total": 27,
      "max_score": 6.5157733,
      "hits": [
         {
            "_index": "dbgap_062617",
            "_type": "dataset",
            ***"_id": "595189d15152c64c3b0adf16"***,
            "_score": 6.5157733,
            "_source": {
               "dataAcquisition": {
                  "performedBy": "\n\t\tT\n\t\t"
               },
               "provenance": {
                  "ingestTime": "201",                     
               },
               "studyGroup": [
                  {
                     "Identifier": "1",
                     "name": "Diseas"
                  }
               ],
               "license": {
                  "downloadURL": "http",                      
               },
               "study": {
                  "alternateIdentifiers": "yes",
                },
               "disease": {
                  "name": [
                     "Coronary Artery Disease"
                  ]
               },
               "NLP_Fields": {
                  "CellLine": [],
                  "MeshID": [
                     "C0066533",                        
                  ],
                  "DiseaseID": [
                     "C0010068"
                  ],
                  "ChemicalID": [],
                  "Disease": [
                     "coronary artery disease"
                  ],
                  "Chemical": [],

                  "Meshterm": [
                     "migen",                        
                  ]
               },
               "datasetDistributions": [
                  {
                     "dateReleased": "20150312",                        
                  }
               ],
               "dataset": {
                  "citations": [
                     "20032323"
                  ],
                  **"description": "The Precoc.",**                  
                  **"title": "MIGen_ExS: PROCARDIS"**
               },
               .... and the list goes on with a bunch of other items ....

在所有这些节点中，我对Unique _Ids，title和description感兴趣。所以，我创建了一个字典，并使用json提取了我感兴趣的部分。这是我的代码：

import json
s={}
d=open('local file','w')
with open('localfile', 'r') as ready:
    for line in ready:
        test=json.loads(line, encoding='utf-8')
        for i in (test['hits']['hits']):
             for x in i:
                  s.setdefault(i['_id'], [i['_source']['dataset']
                  ['description'], i['_source']['dataset']['title']])
        for k, v in s.items():
        d.write(k +'\t'+v[0] +'\t' + v[1] + '\n')
d.close()

现在，当我运行它时，它会给我一个带有重复_Ids的文件！字典是不是要给我独特的_Ids？在我原来的输出文件中，我有很多重复的ID，我想摆脱它们。另外，我只在_ids上运行set（）来获取它们的唯一数量，它达到了138.但是如果我删除生成的重复ID，则使用字典，它会降至17！有人可以告诉我为什么会这样吗？

Answer 1

如果您想要一个唯一的ID，如果您正在使用数据库，它将为您创建它。如果您不是，则需要生成唯一的数字或字符串。根据字典的创建方式，您可以使用创建字典时的时间戳，也可以使用uuid.uuid4（）。有关uuid的更多信息，here are the docs。

字典在python中没有给我独特的ID

1 个答案: