我已将dict列表的子集与重复键上的条件进行了子集化。
例如,关键字' main'上的max_duplicates = 2
和以下列表:
[
{'main': 1, 'more': 1},
{'main': 1, 'more': 2},
{'main': 1, 'more': 3},
{'main': 2, 'more': 1},
{'main': 2, 'more': 1},
{'main': 2, 'more': 3},
{'main': 3, 'more': 1}
]
我想得到:
[
{'main': 1, 'more': 1},
{'main': 1, 'more': 2},
{'main': 2, 'more': 1},
{'main': 2, 'more': 1},
{'main': 3, 'more': 1}
]
给定键的选定元素可以是随机的,键也将始终相同。
我正在寻找最佳的优化解决方案。现在这是我的代码:
from collections import Counter
import numpy
def remove_duplicates(initial_list, max_duplicates):
main_counts = Counter([elem["main"] for elem in initial_list])
main_values_for_selection = set([main_value for main_value, count in main_counts.iteritems()
if count > max_duplicates])
result = [elem for elem in initial_list
if elem["main"] not in main_values_for_selection]
for main_value in main_values_for_selection:
all_indexes = [index for index, elem in enumerate(initial_list)
if elem["main"] == main_value]
indexes = numpy.random.choice(a=all_indexes, size=max_duplicates, replace=False)
result += [initial_list[i] for i in indexes]
return result
提前感谢您的帮助; - )
答案 0 :(得分:0)
此方法始终采用它看到的给定键的前2或max_duplicate
,但我认为它非常有效,只需通过列表查看一次,只需几个临时变量:
from collections import defaultdict
def remove_duplicates(initials,max_dups):
dup_tracker = defaultdict(int)
rets = []
for entry in initials:
if dup_tracker[entry['main']] < max_dups:
dup_tracker[entry['main']] += 1
rets.append(entry)
return rets
max_dups = 2
initials = [
{'main': 1, 'more': 1},
{'main': 1, 'more': 2},
{'main': 1, 'more': 3},
{'main': 2, 'more': 1},
{'main': 2, 'more': 1},
{'main': 2, 'more': 3},
{'main': 3, 'more': 1}
]
rets = remove_duplicates(initials,max_dups)
print rets
为了解释代码,defaultdict(int)
创建了一个字典,其中每个键(即使它尚未定义)从值0开始。接下来,我们遍历列表并跟踪多少个我们在dup_tracker
中看到的每个密钥都是一个由'main'
的值键入的字典,并且按其查看该特定密钥的次数来计算。如果dup_tracker
使用该给定键的条目足够少,则会将其附加到rets
输出数组,然后将其返回。
定时编辑: 看起来我实施的方法比你的方法快至少几个数量级。我在下面列出了我用来计算它们的所有代码。
TL; DR Yours 35.721 seconds
vs mine 0.016 seconds
在50,000个dicts列表上运行时,main
的值范围为0-10,000
from collections import Counter
import random
import time
def remove_duplicates_1(initial_list, max_duplicates):
main_counts = Counter([elem["main"] for elem in initial_list])
main_values_for_selection = set([main_value for main_value, count in main_counts.iteritems()
if count > max_duplicates])
result = [elem for elem in initial_list
if elem["main"] not in main_values_for_selection]
for main_value in main_values_for_selection:
all_indexes = [index for index, elem in enumerate(initial_list)
if elem["main"] == main_value]
indexes = numpy.random.choice(a=all_indexes, size=max_duplicates, replace=False)
result += [initial_list[i] for i in indexes]
return result
def remove_duplicates_2(initials,max_dups):
dup_tracker = {}
rets = []
for entry in initials:
if entry['main'] not in dup_tracker:
dup_tracker[entry['main']] = 1
rets.append(entry)
elif dup_tracker[entry['main']] < max_dups:
dup_tracker[entry['main']] += 1
rets.append(entry)
return rets
def generate_test_list(num_total,max_main):
test_list = []
for it in range(num_total):
main_value = round(random.random()*max_main)
test_list.append({'main':main_value, 'more':it})
return test_list
max_duplicates = 2
test_list = generate_test_list(50000,10000)
start = time.time()
rets_1 = remove_duplicates_1(test_list,max_duplicates)
time_1 = time.time()-start
start = time.time()
rets_2 = remove_duplicates_2(test_list,max_duplicates)
time_2 = time.time()-start
print "Yours",time_1,"vs mine",time_2
#Results:
#Yours 35.7210621834 vs mine 0.0159771442413