我有一个程序逐个读取python对象(这是固定的),需要删除重复的对象。程序将输出一个唯一对象列表。
Psuedo-code与此类似:
1. Create an empty list to store unique object and return at the end
2. Read in a single object
3. If the identical object is not in the list, add to the list
4. Repeat 2 and 3 until no more objects to read, then terminate and return the list (and the number of duplicate objects that were removed).
实际代码使用set操作来检查重复项:
#!/usr/bin/python
import MyObject
import pickle
numDupRemoved = 0
uniqueObjects = set()
with open(inputFile, 'rb') as fileIn:
while 1:
try:
thisObject = pickle.load(fileIn)
if thisObject in uniqueObjects:
numDupRemoved += 1
continue
else:
uniqueObjects.add(thisObject)
except EOFError:
break
print("Number of duplicate objects removed: %d" %numDupRemoved)
return list(uniqueObjects)
(简化)对象看起来像这样(注意所有值都是整数,所以我们不需要担心浮点精度错误):
#!/usr/bin/python
class MyObject:
def __init__(self, attr1, attr2, attr3):
self.attribute1 = attr1 # List of ints
self.attribute2 = attr2 # List of lists (each list is a list of ints)
self.attribute3 = attr3 # List of ints
def __eq__(self, other):
if isinstance(other, self__class__):
return (self.attribute1, self.attribute2, self.attribute3) == (other.attribute1, other.attribute2, other.attribute3)
def __hash__(self):
return self.generateHash()
def generateHash(self):
# Convert lists to tuples
attribute1_tuple = tuple(self.attribute1)
# Since attribute2 is list of list, convert to tuple of tuple
attribute2_tuple = []
for sublist in self.attribute2:
attribute2_tuple.append(tuple(sublist))
attribute2_tuple = tuple(attribute2_tuple)
attribute3_tuple = tuple(self.attribute3)
return hash((attribute1_tuple, attribute2_tuple, attribute3_tuple))
但是,我现在需要通过MyObject的单个属性或属性子集跟踪重复项。也就是说,如果前面的代码只删除了下图中较暗的蓝色区域中的重复项(其中两个对象被认为是重复的,则所有3个属性都相同),我们现在想: 1.通过属性子集(属性1和2)和/或单个属性(属性3)删除重复项 2.能够跟踪图中3个不相交的区域
我创建了另外两个对象:
#!/usr/bin/python
class MyObject_sub1:
def __init__(self, attr1, attr2):
self.attribute1 = attr1 # List of ints
self.attribute2 = attr2 # List of lists (each list is a list of ints)
def __eq__(self, other):
if isinstance(other, self__class__):
return (self.attribute1, self.attribute2) == (other.attribute1, other.attribute2)
def __hash__(self):
return self.generateHash()
def generateHash(self):
# Convert lists to tuples
attribute1_tuple = tuple(self.attribute1)
# Since attribute2 is list of list, convert to tuple of tuple
attribute2_tuple = []
for sublist in self.attribute2:
attribute2_tuple.append(tuple(sublist))
attribute2_tuple = tuple(attribute2_tuple)
return hash((attribute1_tuple, attribute2_tuple))
和
#!/usr/bin/python
class MyObject_sub2:
def __init__(self, attr3):
self.attribute3 = attr3 # List of ints
def __eq__(self, other):
if isinstance(other, self__class__):
return (self.attribute3) == (other.attribute3)
def __hash__(self):
return hash(tuple(self.attribute3))
重复删除代码更新如下:
#!/usr/bin/python
import MyObject
import MyObject_sub1
import MyObject_sub2
import pickle
# counters
totalNumDupRemoved = 0
numDupRemoved_att1Att2Only = 0
numDupRemoved_allAtts = 0
numDupRemoved_att3Only = 0
# sets for duplicate removal purposes
uniqueObjects_att1Att2Only = set()
uniqueObjects_allAtts = set() # Intersection part in the diagram
uniqueObjects_att3Only = set()
with open(inputFile, 'rb') as fileIn:
while 1:
try:
thisObject = pickle.load(fileIn)
# I will omit how thisObject_sub1 (MyObject_sub1) and thisObject_sub2 (MyObject_sub2) are created for brevity
if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:
totalNumDupRemoved += 1
if thisObject in uniqueObjects_allAtts:
numDupRemoved_allAtts += 1
elif thisObject_sub1 in uniqueObjects_att1Att2Only:
numDupRemoved_att1Att2Only += 1
else:
numDupRemoved_att3Only += 1
continue
else:
uniqueObjects_att1Att2Only.add(thisObject_sub1)
uniqueObjects_allAtts.add(thisObject) # Intersection part in the diagram
uniqueObjects_att3Only.add(thisObject_sub2)
except EOFError:
break
print("Total number of duplicates removed: %d" %totalNumDupRemoved)
print("Number of duplicates where all attributes are identical: %d" %numDupRemoved_allAtts)
print("Number of duplicates where attributes 1 and 2 are identical: %d" %numDupRemoved_att1Att2Only)
print("Number of duplicates where only attribute 3 are identical: %d" %numDupRemoved_att3Only)
return list(uniqueObjects_allAtts)
让我疯狂的是,第二个程序中的“numDupRemoved_allAtts”与第一个程序中的“numDupRemoved”不匹配。
例如,两个程序都读取包含大约80,000个总对象的相同文件,并且输出大不相同:
删除的重复对象数:47,742(应该是图的交叉部分)
删除的重复项总数:66,648
所有属性相同的重复数:18,137(图表的交点)
属性1和2相同的重复数:46,121(左图不相交)
只有属性3相同的重复数量:2,390(右图不相交图集)
请注意,在我尝试使用多个python对象(MyObject_sub1和MyObject_sub2)并设置操作之前,我尝试使用元组相等(检查单个元素或属性子集的相等性)进行重复检查,但数字仍然没有不配。
我在这里错过了一些基本的python概念吗?会导致此错误的原因是什么? 任何帮助都会有很大的帮助
答案 0 :(得分:1)
示例:如果第一个处理对象具有属性(1, 2, 3)
而下一个具有(1, 2, 4)
,那么在第一个版本中,两者都将添加为唯一(稍后会识别)。
在第二个变体中,第一个对象将记录在uniqueObjects_att1Att2Only
(和其他集合)中。当第二个对象现在到达时
if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:
为true,并且不会执行记录到else
的{{1}}部分。这意味着永远不会将uniqueObjects_allAtts
添加到(1, 2, 4)
,并且永远不会增加uniqueObjects_allAtts
,无论它出现的频率如何。
解决方案:让每个集合的重复检测一个接一个地独立发生。
为了记录numDupRemoved_allAtts
创建一个标志,当其中一个重复检测触发时设置为totalNumDupRemoved
,如果标志为真,则增加True
。