第一个程序输出

Question

我有一个程序逐个读取python对象（这是固定的），需要删除重复的对象。程序将输出一个唯一对象列表。

Psuedo-code与此类似：

1. Create an empty list to store unique object and return at the end
2. Read in a single object
3. If the identical object is not in the list, add to the list
4. Repeat 2 and 3 until no more objects to read, then terminate and return the list (and the number of duplicate objects that were removed).

实际代码使用set操作来检查重复项：

#!/usr/bin/python
import MyObject
import pickle

numDupRemoved = 0
uniqueObjects = set() 

with open(inputFile, 'rb') as fileIn:
    while 1:
        try:
            thisObject = pickle.load(fileIn)
            if thisObject in uniqueObjects:
                numDupRemoved += 1
                continue
            else:
                uniqueObjects.add(thisObject)
        except EOFError:
            break
    print("Number of duplicate objects removed: %d" %numDupRemoved)
return list(uniqueObjects)

（简化）对象看起来像这样（注意所有值都是整数，所以我们不需要担心浮点精度错误）：

#!/usr/bin/python
class MyObject:
    def __init__(self, attr1, attr2, attr3):
        self.attribute1 = attr1  # List of ints
        self.attribute2 = attr2  # List of lists (each list is a list of ints)
        self.attribute3 = attr3  # List of ints

    def __eq__(self, other):
        if isinstance(other, self__class__):
            return (self.attribute1, self.attribute2, self.attribute3) == (other.attribute1, other.attribute2, other.attribute3)

    def __hash__(self):
        return self.generateHash()

    def generateHash(self):
        # Convert lists to tuples 
        attribute1_tuple = tuple(self.attribute1)

        # Since attribute2 is list of list, convert to tuple of tuple
        attribute2_tuple = []
        for sublist in self.attribute2:
            attribute2_tuple.append(tuple(sublist))
        attribute2_tuple = tuple(attribute2_tuple)

        attribute3_tuple = tuple(self.attribute3)

        return hash((attribute1_tuple, attribute2_tuple, attribute3_tuple))

但是，我现在需要通过MyObject的单个属性或属性子集跟踪重复项。也就是说，如果前面的代码只删除了下图中较暗的蓝色区域中的重复项（其中两个对象被认为是重复的，则所有3个属性都相同），我们现在想： 1.通过属性子集（属性1和2）和/或单个属性（属性3）删除重复项 2.能够跟踪图中3个不相交的区域

我创建了另外两个对象：

#!/usr/bin/python
class MyObject_sub1:
    def __init__(self, attr1, attr2):
        self.attribute1 = attr1  # List of ints
        self.attribute2 = attr2  # List of lists (each list is a list of ints)

    def __eq__(self, other):
        if isinstance(other, self__class__):
            return (self.attribute1, self.attribute2) == (other.attribute1, other.attribute2)

    def __hash__(self):
        return self.generateHash()

    def generateHash(self):
        # Convert lists to tuples 
        attribute1_tuple = tuple(self.attribute1)

        # Since attribute2 is list of list, convert to tuple of tuple
        attribute2_tuple = []
        for sublist in self.attribute2:
            attribute2_tuple.append(tuple(sublist))
        attribute2_tuple = tuple(attribute2_tuple)

        return hash((attribute1_tuple, attribute2_tuple))

和

#!/usr/bin/python
class MyObject_sub2:
    def __init__(self, attr3):
        self.attribute3 = attr3  # List of ints

    def __eq__(self, other):
        if isinstance(other, self__class__):
            return (self.attribute3) == (other.attribute3)

    def __hash__(self):
        return hash(tuple(self.attribute3))

重复删除代码更新如下：

#!/usr/bin/python
import MyObject
import MyObject_sub1
import MyObject_sub2
import pickle

# counters 
totalNumDupRemoved = 0
numDupRemoved_att1Att2Only = 0
numDupRemoved_allAtts = 0
numDupRemoved_att3Only = 0

# sets for duplicate removal purposes
uniqueObjects_att1Att2Only = set()
uniqueObjects_allAtts = set() # Intersection part in the diagram
uniqueObjects_att3Only = set()


with open(inputFile, 'rb') as fileIn:
    while 1:
        try:
            thisObject = pickle.load(fileIn)
            # I will omit how thisObject_sub1 (MyObject_sub1) and thisObject_sub2 (MyObject_sub2) are created for brevity

            if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:
                totalNumDupRemoved += 1
                if thisObject in uniqueObjects_allAtts:
                    numDupRemoved_allAtts += 1
                elif thisObject_sub1 in uniqueObjects_att1Att2Only:
                    numDupRemoved_att1Att2Only += 1
                else:
                    numDupRemoved_att3Only += 1
                continue
            else:
                uniqueObjects_att1Att2Only.add(thisObject_sub1)
                uniqueObjects_allAtts.add(thisObject) # Intersection part in the diagram
                uniqueObjects_att3Only.add(thisObject_sub2)
        except EOFError:
            break
    print("Total number of duplicates removed: %d" %totalNumDupRemoved)
    print("Number of duplicates where all attributes are identical: %d" %numDupRemoved_allAtts)
    print("Number of duplicates where attributes 1 and 2 are identical: %d" %numDupRemoved_att1Att2Only)
    print("Number of duplicates where only attribute 3 are identical: %d" %numDupRemoved_att3Only)
return list(uniqueObjects_allAtts)

让我疯狂的是，第二个程序中的“numDupRemoved_allAtts”与第一个程序中的“numDupRemoved”不匹配。

例如，两个程序都读取包含大约80,000个总对象的相同文件，并且输出大不相同：

第一个程序输出

删除的重复对象数：47,742（应该是图的交叉部分）

第二个程序输出

删除的重复项总数：66,648

所有属性相同的重复数：18,137（图表的交点）

属性1和2相同的重复数：46,121（左图不相交）

只有属性3相同的重复数量：2,390（右图不相交图集）

请注意，在我尝试使用多个python对象（MyObject_sub1和MyObject_sub2）并设置操作之前，我尝试使用元组相等（检查单个元素或属性子集的相等性）进行重复检查，但数字仍然没有不配。

我在这里错过了一些基本的python概念吗？会导致此错误的原因是什么？任何帮助都会有很大的帮助

Answer 1

示例：如果第一个处理对象具有属性(1, 2, 3)而下一个具有(1, 2, 4)，那么在第一个版本中，两者都将添加为唯一（稍后会识别）。

在第二个变体中，第一个对象将记录在uniqueObjects_att1Att2Only（和其他集合）中。当第二个对象现在到达时

if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:

为true，并且不会执行记录到else的{{1}}部分。这意味着永远不会将uniqueObjects_allAtts添加到(1, 2, 4)，并且永远不会增加uniqueObjects_allAtts，无论它出现的频率如何。

解决方案：让每个集合的重复检测一个接一个地独立发生。

为了记录numDupRemoved_allAtts创建一个标志，当其中一个重复检测触发时设置为totalNumDupRemoved，如果标志为真，则增加True。

Python通过其属性或单个属性的子集复制对象

第一个程序输出

第二个程序输出

1 个答案: