Python通过其属性或单个属性的子集复制对象

时间:2018-01-03 22:54:20

标签: python object duplicates

我有一个程序逐个读取python对象(这是固定的),需要删除重复的对象。程序将输出一个唯一对象列表。

Psuedo-code与此类似:

1. Create an empty list to store unique object and return at the end
2. Read in a single object
3. If the identical object is not in the list, add to the list
4. Repeat 2 and 3 until no more objects to read, then terminate and return the list (and the number of duplicate objects that were removed).

实际代码使用set操作来检查重复项:

#!/usr/bin/python
import MyObject
import pickle

numDupRemoved = 0
uniqueObjects = set() 

with open(inputFile, 'rb') as fileIn:
    while 1:
        try:
            thisObject = pickle.load(fileIn)
            if thisObject in uniqueObjects:
                numDupRemoved += 1
                continue
            else:
                uniqueObjects.add(thisObject)
        except EOFError:
            break
    print("Number of duplicate objects removed: %d" %numDupRemoved)
return list(uniqueObjects)

(简化)对象看起来像这样(注意所有值都是整数,所以我们不需要担心浮点精度错误):

#!/usr/bin/python
class MyObject:
    def __init__(self, attr1, attr2, attr3):
        self.attribute1 = attr1  # List of ints
        self.attribute2 = attr2  # List of lists (each list is a list of ints)
        self.attribute3 = attr3  # List of ints

    def __eq__(self, other):
        if isinstance(other, self__class__):
            return (self.attribute1, self.attribute2, self.attribute3) == (other.attribute1, other.attribute2, other.attribute3)

    def __hash__(self):
        return self.generateHash()

    def generateHash(self):
        # Convert lists to tuples 
        attribute1_tuple = tuple(self.attribute1)

        # Since attribute2 is list of list, convert to tuple of tuple
        attribute2_tuple = []
        for sublist in self.attribute2:
            attribute2_tuple.append(tuple(sublist))
        attribute2_tuple = tuple(attribute2_tuple)

        attribute3_tuple = tuple(self.attribute3)

        return hash((attribute1_tuple, attribute2_tuple, attribute3_tuple))

但是,我现在需要通过MyObject的单个属性或属性子集跟踪重复项。也就是说,如果前面的代码只删除了下图中较暗的蓝色区域中的重复项(其中两个对象被认为是重复的,则所有3个属性都相同),我们现在想: 1.通过属性子集(属性1和2)和/或单个属性(属性3)删除重复项 2.能够跟踪图中3个不相交的区域

enter image description here

我创建了另外两个对象:

#!/usr/bin/python
class MyObject_sub1:
    def __init__(self, attr1, attr2):
        self.attribute1 = attr1  # List of ints
        self.attribute2 = attr2  # List of lists (each list is a list of ints)

    def __eq__(self, other):
        if isinstance(other, self__class__):
            return (self.attribute1, self.attribute2) == (other.attribute1, other.attribute2)

    def __hash__(self):
        return self.generateHash()

    def generateHash(self):
        # Convert lists to tuples 
        attribute1_tuple = tuple(self.attribute1)

        # Since attribute2 is list of list, convert to tuple of tuple
        attribute2_tuple = []
        for sublist in self.attribute2:
            attribute2_tuple.append(tuple(sublist))
        attribute2_tuple = tuple(attribute2_tuple)

        return hash((attribute1_tuple, attribute2_tuple))

#!/usr/bin/python
class MyObject_sub2:
    def __init__(self, attr3):
        self.attribute3 = attr3  # List of ints

    def __eq__(self, other):
        if isinstance(other, self__class__):
            return (self.attribute3) == (other.attribute3)

    def __hash__(self):
        return hash(tuple(self.attribute3))

重复删除代码更新如下:

#!/usr/bin/python
import MyObject
import MyObject_sub1
import MyObject_sub2
import pickle

# counters 
totalNumDupRemoved = 0
numDupRemoved_att1Att2Only = 0
numDupRemoved_allAtts = 0
numDupRemoved_att3Only = 0

# sets for duplicate removal purposes
uniqueObjects_att1Att2Only = set()
uniqueObjects_allAtts = set() # Intersection part in the diagram
uniqueObjects_att3Only = set()


with open(inputFile, 'rb') as fileIn:
    while 1:
        try:
            thisObject = pickle.load(fileIn)
            # I will omit how thisObject_sub1 (MyObject_sub1) and thisObject_sub2 (MyObject_sub2) are created for brevity

            if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:
                totalNumDupRemoved += 1
                if thisObject in uniqueObjects_allAtts:
                    numDupRemoved_allAtts += 1
                elif thisObject_sub1 in uniqueObjects_att1Att2Only:
                    numDupRemoved_att1Att2Only += 1
                else:
                    numDupRemoved_att3Only += 1
                continue
            else:
                uniqueObjects_att1Att2Only.add(thisObject_sub1)
                uniqueObjects_allAtts.add(thisObject) # Intersection part in the diagram
                uniqueObjects_att3Only.add(thisObject_sub2)
        except EOFError:
            break
    print("Total number of duplicates removed: %d" %totalNumDupRemoved)
    print("Number of duplicates where all attributes are identical: %d" %numDupRemoved_allAtts)
    print("Number of duplicates where attributes 1 and 2 are identical: %d" %numDupRemoved_att1Att2Only)
    print("Number of duplicates where only attribute 3 are identical: %d" %numDupRemoved_att3Only)
return list(uniqueObjects_allAtts)

让我疯狂的是,第二个程序中的“numDupRemoved_allAtts”与第一个程序中的“numDupRemoved”不匹配。

例如,两个程序都读取包含大约80,000个总对象的相同文件,并且输出大不相同:

第一个程序输出

删除的重复对象数:47,742(应该是图的交叉部分)

第二个程序输出

删除的重复项总数:66,648

所有属性相同的重复数:18,137(图表的交点)

属性1和2相同的重复数:46,121(左图不相交)

只有属性3相同的重复数量:2,390(右图不相交图集)

请注意,在我尝试使用多个python对象(MyObject_sub1和MyObject_sub2)并设置操作之前,我尝试使用元组相等(检查单个元素或属性子集的相等性)进行重复检查,但数字仍然没有不配。

我在这里错过了一些基本的python概念吗?会导致此错误的原因是什么? 任何帮助都会有很大的帮助

1 个答案:

答案 0 :(得分:1)

示例:如果第一个处理对象具有属性(1, 2, 3)而下一个具有(1, 2, 4),那么在第一个版本中,两者都将添加为唯一(稍后会识别)。

在第二个变体中,第一个对象将记录在uniqueObjects_att1Att2Only(和其他集合)中。当第二个对象现在到达时

if thisObject_sub1 in uniqueObjects_att1Att2Only or thisObject_sub2 in uniqueObjects_att3Only:

为true,并且不会执行记录到else的{​​{1}}部分。这意味着永远不会将uniqueObjects_allAtts添加到(1, 2, 4),并且永远不会增加uniqueObjects_allAtts,无论它出现的频率如何。

解决方案:让每个集合的重复检测一个接一个地独立发生。

为了记录numDupRemoved_allAtts创建一个标志,当其中一个重复检测触发时设置为totalNumDupRemoved,如果标志为真,则增加True