通过保留最大重复项来设置dict列表

时间:2016-08-03 16:16:49

标签: python-2.7

我已将dict列表的子集与重复键上的条件进行了子集化。 例如,关键字' main'上的max_duplicates = 2和以下列表:

[
 {'main': 1, 'more': 1},
 {'main': 1, 'more': 2},
 {'main': 1, 'more': 3},
 {'main': 2, 'more': 1},
 {'main': 2, 'more': 1},
 {'main': 2, 'more': 3},
 {'main': 3, 'more': 1}
]

我想得到:

[
 {'main': 1, 'more': 1},
 {'main': 1, 'more': 2},
 {'main': 2, 'more': 1},
 {'main': 2, 'more': 1},
 {'main': 3, 'more': 1}
]

给定键的选定元素可以是随机的,键也将始终相同。

我正在寻找最佳的优化解决方案。现在这是我的代码:

from collections import Counter
import numpy


def remove_duplicates(initial_list, max_duplicates):
    main_counts = Counter([elem["main"] for elem in initial_list])
    main_values_for_selection = set([main_value for main_value, count in main_counts.iteritems()
                                     if count > max_duplicates])
    result = [elem for elem in initial_list
              if elem["main"] not in main_values_for_selection]

    for main_value in main_values_for_selection:
        all_indexes = [index for index, elem in enumerate(initial_list)
                       if elem["main"] == main_value]
        indexes = numpy.random.choice(a=all_indexes, size=max_duplicates, replace=False)
        result += [initial_list[i] for i in indexes]
    return result

提前感谢您的帮助; - )

1 个答案:

答案 0 :(得分:0)

此方法始终采用它看到的给定键的前2或max_duplicate,但我认为它非常有效,只需通过列表查看一次,只需几个临时变量:

from collections import defaultdict

def remove_duplicates(initials,max_dups):
    dup_tracker = defaultdict(int)
    rets = []
    for entry in initials:
        if dup_tracker[entry['main']] < max_dups:
            dup_tracker[entry['main']] += 1
            rets.append(entry)
    return rets

max_dups = 2
initials = [
 {'main': 1, 'more': 1},
 {'main': 1, 'more': 2},
 {'main': 1, 'more': 3},
 {'main': 2, 'more': 1},
 {'main': 2, 'more': 1},
 {'main': 2, 'more': 3},
 {'main': 3, 'more': 1}
]


rets = remove_duplicates(initials,max_dups)        
print rets

为了解释代码,defaultdict(int)创建了一个字典,其中每个键(即使它尚未定义)从值0开始。接下来,我们遍历列表并跟踪多少个我们在dup_tracker中看到的每个密钥都是一个由'main'的值键入的字典,并且按其查看该特定密钥的次数来计算。如果dup_tracker使用该给定键的条目足够少,则会将其附加到rets输出数组,然后将其返回。

定时编辑: 看起来我实施的方法比你的方法快至少几个数量级。我在下面列出了我用来计算它们的所有代码。

TL; DR Yours 35.721 seconds vs mine 0.016 seconds在50,000个dicts列表上运行时,main的值范围为0-10,000

from collections import Counter
import random
import time
​
def remove_duplicates_1(initial_list, max_duplicates):
    main_counts = Counter([elem["main"] for elem in initial_list])
    main_values_for_selection = set([main_value for main_value, count in main_counts.iteritems()
                                     if count > max_duplicates])
    result = [elem for elem in initial_list
              if elem["main"] not in main_values_for_selection]
​
    for main_value in main_values_for_selection:
        all_indexes = [index for index, elem in enumerate(initial_list)
                       if elem["main"] == main_value]
        indexes = numpy.random.choice(a=all_indexes, size=max_duplicates, replace=False)
        result += [initial_list[i] for i in indexes]
    return result
​
​
def remove_duplicates_2(initials,max_dups):
    dup_tracker = {}
    rets = []
    for entry in initials:
        if entry['main'] not in dup_tracker:
            dup_tracker[entry['main']] = 1
            rets.append(entry)
        elif dup_tracker[entry['main']] < max_dups:
            dup_tracker[entry['main']] += 1
            rets.append(entry)
    return rets
​
def generate_test_list(num_total,max_main):
    test_list = []
    for it in range(num_total):
        main_value = round(random.random()*max_main)
        test_list.append({'main':main_value, 'more':it})
    return test_list
​
max_duplicates = 2
test_list = generate_test_list(50000,10000)
​
start = time.time()
rets_1 = remove_duplicates_1(test_list,max_duplicates)
time_1 = time.time()-start
​
start = time.time()
rets_2 = remove_duplicates_2(test_list,max_duplicates)
time_2 = time.time()-start
​
print "Yours",time_1,"vs mine",time_2

#Results:
#Yours 35.7210621834 vs mine 0.0159771442413