子类化ndarray在pyspark中广播时丢弃信息

时间:2017-04-03 21:30:51

标签: python numpy pyspark

我希望有人可以帮我调试我们在火花中使用子类ndarray查看的问题。特别是当broadcast子类阵列时,它似乎丢失了额外的信息。一个简单的例子如下:

>>> import numpy as np
>>> 
>>> class Test(np.ndarray):
...     def __new__(cls, input_array, info=None):
...         obj = np.asarray(input_array).view(cls)
...         obj.info = info
...         return obj
...     
...     def __array_finalize__(self, obj):
...         if not hasattr(self, "info"):
...             self.info = getattr(obj, 'info', None)
...         else:
...             print("has info attribute: %s" % getattr(self, 'info'))
... 
>>> test = Test(np.array([[1,2,3],[4,5,6]]), info="info")
>>> print(test.info)
info
>>> print(sc.broadcast(test).value)
[[1 2 3]
 [4 5 6]]
>>> print(sc.broadcast(test).value.info)
None

1 个答案:

答案 0 :(得分:0)

至少,你有一个小错字 - 你正在检查hasattr(obj, "info"),而应该检查if hasattr(self, "info")。由于if语句翻转,信息不会被转移。

test = Test(np.array([[1,2,3],[4,5,6]]), info="info")
print test.info # info
test2 = test[1:]
print test2.info # info