我正在解析100个遵循类似格式的文件。从文件中,我创建了一个字典,其中可能包含两个键或两个以上的键,其中值在一个集合中。无论如何,总会有一个包含'Y'值的键。对于该密钥,我需要删除其他密钥中存在的任何重复值。
我有一个类似的问题,我只有两个键,它已经解决了。 Python: How to compare values of different keys in dictionary and then delete duplicates?
当字典有两个键但不超过两个时,下面的代码工作正常。
for d, p in zip(temp_list, temp_search_list):
temp2[d].add(p) #dictionary with delvt and pin names for cell
for test_d, test_p in temp2.items():
if not re.search('Y', ' '.join(test_p)) :
tp = temp2[test_d]
else:
temp2[test_d] = [t for t in temp2[test_d] if t not in tp]
使用三个键的示例字典,但根据解析的文件,我可以拥有更多的键。
temp2 = {'0.1995': set(['X7:GATE', 'X3:GATE', 'IN1']), '0.199533': set(['X4:GATE', 'X8:GATE', 'IN2']), '0.399': set(['X3:GATE', 'X5:GATE', 'X1:GATE', 'IN0', 'X4:GATE', 'Y', 'X8:GATE'])}
预期产出:
temp2
{'0.1995': set(['X7:GATE', 'X3:GATE','IN1']), '0.199533': set(['X4:GATE', 'X8:GATE', 'IN2']), '0.399': set(['X5:GATE', 'X1:GATE', 'IN0', 'Y'])}
答案 0 :(得分:1)
你可以只用一个实际上必须遍历整个数据集的循环来完成整个事情。
from collections import defaultdict
target = None
result = defaultdict(set)
occurance_dict = defaultdict(int)
# Loop over the inputs, building the result, counting the
# number of occurances for each value as you go and marking
# the key that contains 'Y'
for key, value in zip(temp_list, temp_search_list):
# This is here so we don't count values twice if there
# is more than one instance of the value for the given
# key. If we don't do this, if a value only exists in
# the 'Y' set, but it occurs multiple times in the input,
# we would still filter it out later on.
if value not in result[key]:
occurance_dict[value] += 1
result[key].add(value)
if value == 'Y':
if target is None:
target = key
else:
raise ValueError('Dataset contains more than 1 entry containing "Y"')
if target is None:
raise ValueError('Dataset contains no entry containing "Y"')
# Filter the marked ('Y' containing) entry; if there is more than
# 1 occurance of the given value, then it exists in another entry
# so we don't want it in the 'Y' entry
result[target] = {value for value in result[target] if occurance_dict[value] == 1}
是occurance_dict
与collections.Counter
非常相似,但我宁愿不对数据集进行两次迭代(即使它发生在幕后),如果我不需要,而且我们也不计算同一个键的给定值的第二次出现。
答案 1 :(得分:1)
您需要将搜索中的Y
值与搜索其他数据分开。当你已经构建temp2
时,你真的想这样做,以避免不必要的循环:
y_key = None
for d, p in zip(temp_list, temp_search_list):
temp2[d].add(p)
if p == 'Y':
y_key = d
接下来,使用set.difference_update()
就地更改数据集最容易删除欺骗值:
y_values = temp2[y_key]
for test_d, test_p in temp2.iteritems():
if test_d == y_key:
continue
y_values.difference_update(test_p)
使用您的示例temp2
,并假设在构建y_key
时已设置temp2
,第二个循环的结果为:
>>> temp2 = {'0.1995': set(['X7:GATE', 'X3:GATE', 'IN1']), '0.199533': set(['X4:GATE', 'X8:GATE', 'IN2']), '0.399': set(['X3:GATE', 'X5:GATE', 'X1:GATE', 'IN0', 'X4:GATE', 'Y', 'X8:GATE'])}
>>> y_key = '0.399'
>>> y_values = temp2[y_key]
>>> for test_d, test_p in temp2.iteritems():
... if test_d == y_key:
... continue
... y_values.difference_update(test_p)
...
>>> temp2
{'0.1995': set(['X7:GATE', 'X3:GATE', 'IN1']), '0.199533': set(['X4:GATE', 'X8:GATE', 'IN2']), '0.399': set(['X5:GATE', 'X1:GATE', 'IN0', 'Y'])}
请注意X3:GATE
集合中X4:GATE
,X8:GATE
和0.399
的值已被删除。
答案 2 :(得分:0)
我希望通过列表推导和/或itertools模块可以想到一个可爱的方法,但我不能。我会从以下内容开始:
dict1 = {1: set([1,2,3,4,5]),
2: set([3,4,5,6]),
3: set([1,7,8,9])
}
list1 = dict1.items()
newDict = {}
for i in range(len(list1)):
(k1,set1) = list1[i]
newDict[k1] = set1
for j in range(i+1,len(list1)):
(k2, set2) = list1[j]
newDict[k2] = set2 - (set1 & set2)
print newDict
# {1: set([1, 2, 3, 4, 5]), 2: set([6]), 3: set([8, 9, 7])}
如果你有庞大的词典,这可能不是非常有效。
另一个想法是:这些集合太长,以至于你不能只是形成一个collection.Counter
?你首先要通过dict并删除每组中的成员并将它们粘在一个计数器中(可能可以用列表理解在一行中完成)。然后,循环遍历originalDict.iteritems()
。在新的字典中,你可以插入其值为原始集合的密钥(即0.1995),过滤(我认为如上所述使用&
),这样它只包含计数器中带有>的条目。 0.对于插入新词典的所有元素,将其从计数器中删除(即使它们具有> 1计数)。在一天结束时,你仍然需要循环两次。
答案 3 :(得分:0)
对我来说似乎很直截了当。首先找到在其值集合中具有'Y'
的键,然后遍历所有其他值集并从该组值中删除它们。
temp2 = {'0.1995': set(['X7:GATE', 'X3:GATE', 'IN1']),
'0.199533':set(['X4:GATE', 'X8:GATE', 'IN2']),
'0.399': set(['X3:GATE', 'X5:GATE', 'X1:GATE', 'IN0', 'X4:GATE', 'Y', 'X8:GATE'])}
y_key = None
for k,v in temp2.iteritems():
if 'Y' in v:
y_key = k
break
if y_key is None:
print "no 'Y' found in values"
exit()
result = {}
for k,v in temp2.iteritems():
if k != y_key:
temp2[y_key] -= v
print 'temp2 = {'
for k,v in temp2.iteritems():
print ' {!r}: {!r},'.format(k,v)
print '}'
输出:
temp2 = {
'0.1995': set(['X7:GATE', 'X3:GATE', 'IN1']),
'0.199533': set(['X4:GATE', 'X8:GATE', 'IN2']),
'0.399': set(['X5:GATE', 'X1:GATE', 'IN0', 'Y']),
}