Question

我在the numpy documentation之后创建了numpy ndarray的子类。特别是，我通过修改提供的代码来added a custom attribute。

我正在使用Python multiprocessing在并行循环中操作此类的实例。据我了解，范围基本上“复制”到多个线程的方式是使用pickle。

我现在遇到的问题与numpy数组被腌制的方式有关。我找不到任何关于此的综合文档，但是有些discussions between the dill developers建议我应该关注__reduce__方法，这种方法在进行酸洗时会被调用。

任何人都可以对此有所了解吗？最小的工作示例实际上只是我上面链接的numpy示例代码，为了完整性而复制到这里：

import numpy as np

class RealisticInfoArray(np.ndarray):

    def __new__(cls, input_array, info=None):
        # Input array is an already formed ndarray instance
        # We first cast to be our class type
        obj = np.asarray(input_array).view(cls)
        # add the new attribute to the created instance
        obj.info = info
        # Finally, we must return the newly created object:
        return obj

    def __array_finalize__(self, obj):
        # see InfoArray.__array_finalize__ for comments
        if obj is None: return
        self.info = getattr(obj, 'info', None)

现在问题在于：

import pickle

obj = RealisticInfoArray([1, 2, 3], info='foo')
print obj.info  # 'foo'

pickle_str = pickle.dumps(obj)
new_obj = pickle.loads(pickle_str)
print new_obj.info  #  raises AttributeError

感谢。

Answer 1

np.ndarray使用__reduce__来腌制自己。我们可以看一下当你调用该函数时它实际返回的内容，以了解正在发生的事情：

>>> obj = RealisticInfoArray([1, 2, 3], info='foo')
>>> obj.__reduce__()
(<built-in function _reconstruct>, (<class 'pick.RealisticInfoArray'>, (0,), 'b'), (1, (3,), dtype('int64'), False, '\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'))

所以，我们得到一个3元组。 __reduce__的文档描述了每个元素的作用：

返回元组时，它必须介于2到5个元素之间   长。可以省略可选元素，也可以不提供None   作为他们的价值。这个元组的内容正常腌制   用于在unpickling时重建对象。的语义   每个元素都是：



将调用以创建初始版本的可调用对象   物体。元组的下一个元素将为其提供参数   这个可调用的，后来的元素提供了额外的状态信息   随后将用于完全重建腌制数据。

在unpickling环境中，此对象必须是类，a   可调用注册为“安全构造函数”（见下文），或必须   具有带有真值的属性__safe_for_unpickling__。   否则，将在unpickling中引发UnpicklingError   环境。请注意，像往常一样，callable本身被腌制   名称



可调用对象的参数元组。



可选地，对象的状态，将传递给对象的状态   正如修补和取消正常类实例一节中所述的__setstate__()方法。如果对象没有__setstate__()方法，   然后，如上所述，该值必须是字典，它将被添加到   对象的__dict__。

因此，_reconstruct是调用重建对象的函数，(<class 'pick.RealisticInfoArray'>, (0,), 'b')是传递给该函数的参数，(1, (3,), dtype('int64'), False, '\x01\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00'))传递给类“__setstate__。这给了我们一个机会;我们可以覆盖__reduce__并向__setstate__提供我们自己的元组，然后另外覆盖__setstate__，以便在我们进行unpickle时设置我们的自定义属性。我们只需要确保保留父类所需的所有数据，并调用父类__setstate__：

class RealisticInfoArray(np.ndarray):
    def __new__(cls, input_array, info=None):
        obj = np.asarray(input_array).view(cls)
        obj.info = info
        return obj

    def __array_finalize__(self, obj):
        if obj is None: return
        self.info = getattr(obj, 'info', None)

    def __reduce__(self):
        # Get the parent's __reduce__ tuple
        pickled_state = super(RealisticInfoArray, self).__reduce__()
        # Create our own tuple to pass to __setstate__
        new_state = pickled_state[2] + (self.info,)
        # Return a tuple that replaces the parent's __setstate__ tuple with our own
        return (pickled_state[0], pickled_state[1], new_state)

    def __setstate__(self, state):
        self.info = state[-1]  # Set the info attribute
        # Call the parent's __setstate__ with the other tuple elements.
        super(RealisticInfoArray, self).__setstate__(state[0:-1])

用法：

>>> obj = pick.RealisticInfoArray([1, 2, 3], info='foo')
>>> pickle_str = pickle.dumps(obj)
>>> pickle_str
"cnumpy.core.multiarray\n_reconstruct\np0\n(cpick\nRealisticInfoArray\np1\n(I0\ntp2\nS'b'\np3\ntp4\nRp5\n(I1\n(I3\ntp6\ncnumpy\ndtype\np7\n(S'i8'\np8\nI0\nI1\ntp9\nRp10\n(I3\nS'<'\np11\nNNNI-1\nI-1\nI0\ntp12\nbI00\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\np13\nS'foo'\np14\ntp15\nb."
>>> new_obj = pickle.loads(pickle_str)
>>> new_obj.info
'foo'

Answer 2

我是dill（和pathos）作者。 dill在numpy.array可以自行完成之前numpy正在腌制dill。 @ dano的解释非常准确。我个人而言，我只是使用dill让它为你完成这项工作。对于__reduce__，您不需要dill，因为__dict__有几种方法可以抓取子类属性...其中一种方法是为任何类对象存储pickle。 __reduce__不执行此操作，b / c它通常使用名称引用的类，而不是存储类对象本身...所以你必须使用pickle使dill工作您。在大多数情况下，不需要使用>>> import numpy as np >>> >>> class RealisticInfoArray(np.ndarray): ... def __new__(cls, input_array, info=None): ... # Input array is an already formed ndarray instance ... # We first cast to be our class type ... obj = np.asarray(input_array).view(cls) ... # add the new attribute to the created instance ... obj.info = info ... # Finally, we must return the newly created object: ... return obj ... def __array_finalize__(self, obj): ... # see InfoArray.__array_finalize__ for comments ... if obj is None: return ... self.info = getattr(obj, 'info', None) ... >>> import dill as pickle >>> obj = RealisticInfoArray([1, 2, 3], info='foo') >>> print obj.info # 'foo' foo >>> >>> pickle_str = pickle.dumps(obj) >>> new_obj = pickle.loads(pickle_str) >>> print new_obj.info foo。

dill

pickle可以将自己扩展为copy_reg（基本上由dill知道的所有内容），因此您可以在使用{{1}的任何内容中使用所有pickle类型}。现在，如果你打算使用multiprocessing，你会有点紧张，因为它使用cPickle。但是，pathos的{{1}}分叉（称为multiprocessing），基本上唯一的变化是使用pathos.multiprocessing代替dill ...因此可以在cPickle中序列化更多内容。我认为（目前）如果你想在Pool.map（或numpy.array）中使用multiprocessing的子类，你可能需要做像@dano建议的那样 - 但不确定因为我没有想到一个很好的案例来测试你的子类。

如果您有兴趣，请点击此处pathos.multiprocessing：https://github.com/uqfoundation

Answer 3

在这里，@dano的答案和@Gabriel的评论略有改进。利用__dict__属性进行序列化即使对于子类也对我有用。

def __reduce__(self):
    # Get the parent's __reduce__ tuple
    pickled_state = super(RealisticInfoArray, self).__reduce__()
    # Create our own tuple to pass to __setstate__, but append the __dict__ rather than individual members.
    new_state = pickled_state[2] + (self.__dict__,)
    # Return a tuple that replaces the parent's __setstate__ tuple with our own
    return (pickled_state[0], pickled_state[1], new_state)

def __setstate__(self, state):
    self.__dict__.update(state[-1])  # Update the internal dict from state
    # Call the parent's __setstate__ with the other tuple elements.
    super(RealisticInfoArray, self).__setstate__(state[0:-1])

下面是一个完整的示例：https://onlinegdb.com/SJ88d5DLB

在numpy数组的pickling子类时保留自定义属性

3 个答案: