python删除set中的重复值

时间:2016-01-18 11:35:40

标签: python list dictionary map-function

我有一个看起来像这样的一套:

my_set  = {
  [
      {
         "sample_id": "read1", 
         "seg_1": None, 
         "lukM-F": "D", 
         "23s_SA": None, 
         "see": None, 
         "sed": "ND"
      }, 
      {
         "sample_id": "read2", 
         "seg_1": None, 
         "lukM-F": "ND", 
         "23s_SA": None, 
         "see": "D", 
         "sed": "ND"
      }, 
      {
         "sample_id": "read3", 
         "seg_1": None, 
         "lukM-F": "D", 
         "23s_SA": None, 
         "see": "ND", 
         "sed": "None"
      }
  ]
}

我想删除具有值'无'的键。整个字符串。例如,示例:如果"无"是键的价值" seg_1"在每个sample_id(read1和read2 AND read3)中,然后完全删除密钥。如果有一个"无" in" seg_1&#34 ;,比如read1,另外两个sample_id不是"无"然后保持" seg_1"及其价值观。所以我想最终得到以下结论:

my_set  = {
  [
      {
         "sample_id": "read1",  
         "lukM-F": "D", 
         "see": None, 
         "sed": "ND"
      }, 
      {
         "sample_id": "read2", 
         "lukM-F": "ND", 
         "see": "D", 
         "sed": "ND"
      }, 
      {
         "sample_id": "read3", 
         "lukM-F": "D", 
         "see": "ND", 
         "sed": "None"
      }
  ]
}

请注意,seg_1和23s_SA现已被删除,因为它们的值为“无”'遍及所有sample_ids。

我花了很长时间尝试这样做但没有成功。我终于将set转换为dict然后列出然后遍历所有列表并删除所有包含None的列表中的所有项目。

number_of_samples = len(my_set)
each_sample_list = [[] for i in range(0, number_of_samples)]

n = 0

for data_in_dict in my_set:
  for k,val in data_in_dict.items():
    each_sample_list[n].append([k,val])
  if n == number_of_samples:
    break
  else:
    print each_sample_list[n]
    n += 1

我想过使用itertools izip来遍历多个列表,但不确定这是否会起作用。非常感谢任何帮助。

由于

3 个答案:

答案 0 :(得分:3)

您可以创建计数器,然后删除所有需要的键:

import collections
import itertools

source = [ 
  {
     "sample_id": "read1", 
     "seg_1": None, 
     "lukM-F": "D", 
     "23s_SA": None, 
     "see": None, 
     "sed": "ND"
  }, 
  {
     "sample_id": "read2", 
     "seg_1": None, 
     "lukM-F": "ND", 
     "23s_SA": None, 
     "see": "D", 
     "sed": "ND"
  }, 
  {
     "sample_id": "read3", 
     "seg_1": None, 
     "lukM-F": "D", 
     "23s_SA": None, 
     "see": "ND", 
     "sed": "None"
  }
]

size = len(source)

# for python2 you should use iteritems() method
iterators_chain = itertools.chain(*[x.items() for x in source])
counter = collections.Counter(iterators_chain)

for (key, val), count in counter.items():
    if count == size and val is None:
        for x in source:
            x.pop(key)

答案 1 :(得分:2)

您的if else不是有效集,因为设置项必须是可清除的,并且列表不可清除。但无论如何......

这是一种不需要任何进口的方法。它使用集合来确定要保留的密钥。

function sort(arr) {
  // Always show these first.
  var showFirst = ["Banana", "Apple", "Orange"];
  // Create a return variable.
  var finalArray = [];
  // Loop through the showFirst to add the array elements if present.
  $.each(showFirst, function (i, v) {
    if ($.inArray(v, arr)) {
      // Push this to the final in the same order.
      finalArray.push(v);
      // Remove it from the original array.
      arr[i] = undefined;
    }
  });
  // After this add the rest.
  $.each(arr, function (i, v) {
    if (typeof v != "undefined")
      finalArray.push(v);
  });
  // Return the final array.
  return finalArray;
}

<强>输出

my_set

my_stuff = [ { "sample_id": "read1", "seg_1": None, "lukM-F": "D", "23s_SA": None, "see": None, "sed": "ND" }, { "sample_id": "read2", "seg_1": None, "lukM-F": "ND", "23s_SA": None, "see": "D", "sed": "ND" }, { "sample_id": "read3", "seg_1": None, "lukM-F": "D", "23s_SA": None, "see": "ND", "sed": None } ] allkeys = set(k for d in my_stuff for k in d) goodkeys = set(k for k in allkeys if any(d.get(k) for d in my_stuff)) badkeys = allkeys - goodkeys for d in my_stuff: for k in badkeys: del d[k] for d in my_stuff: print(d) {'lukM-F': 'D', 'see': None, 'sed': 'ND', 'sample_id': 'read1'} {'lukM-F': 'ND', 'see': 'D', 'sed': 'ND', 'sample_id': 'read2'} {'lukM-F': 'D', 'see': 'ND', 'sed': None, 'sample_id': 'read3'} 的{​​{1}}构造可以在现代版本的Python中用set comprehensions取代,但我在这台古老的机器上使用Python 2.6.6。

构建set(...)集的另一种方法是

allkeys

虽然代码更多,但它运行得更快,因为goodkeys正在以C速度处理allkeys的整个密钥集合,而另一种方法必须在Python上循环遍历密钥速度。当然,如果您可以保证列表中每个allkeys = set() for d in my_stuff: allkeys.update(d.keys()) 的密钥集始终相同,那么可以进一步优化。

答案 2 :(得分:2)

利用None

内所有dict中的密钥必须list
bkeys = [k for k, v in next(iter(my_stuff), {}).items() if v is None]

bkeys = [k for k in bkeys if all(d[k] is None for d in my_stuff)]

my_stuff = [{k: v for k, v in d.items() if k not in bkeys} for d in my_stuff]

my_stuff的打印输出:

{'see': None, 'sed': 'ND', 'lukM-F': 'D', 'sample_id': 'read1'}
{'see': 'D', 'sed': 'ND', 'lukM-F': 'ND', 'sample_id': 'read2'}
{'see': 'ND', 'sed': None, 'lukM-F': 'D', 'sample_id': 'read3'}

如果没有dict理解,只需将最后一行更改为:

my_stuff = [dict(((k, v) for k, v in d.items() if k not in bkeys)) for d in my_stuff]

已编辑仅适用于第一项的None键(如果有)。