Question

我正在尝试腌制从pandas.DataFrame继承的对象。我添加到数据框的属性在酸洗/取消酸洗过程中消失了。有一些明显的解决方法，但是...我是在做错什么，还是这是一个错误？

import pandas as pd
import pickle

class Foo(pd.DataFrame):
    def __init__(self,tag,df):
        super().__init__(df)
        self._tag = tag

foo = Foo('mytag', pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}))
print(foo)
print(foo._tag)

print("-------------------------------------")

with open("foo.pkl", "wb") as pkl:
    pickle.dump(foo, pkl)

with open("foo.pkl", "rb") as pkl:
    foo1 = pickle.load(pkl)

print(type(foo1))
print(foo1)
print(foo1._tag)

这是我的输出：

   a  b
0  1  4
1  2  5
2  3  6
mytag
-------------------------------------
<class '__main__.Foo'>
   a  b
0  1  4
1  2  5
2  3  6
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-1e7e89e199c8> in <module>
     21 print(type(foo1))
     22 print(foo1)
---> 23 print(foo1._tag)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5066                 return self[name]
-> 5067             return object.__getattribute__(self, name)
   5068 
   5069     def __setattr__(self, name, value):

AttributeError: 'Foo' object no attribute '_tag'

（python 3.7，pandas 0.24.2，pickle.format_version 4.0）

Answer 1

我认为这是熊猫如何处理属性的问题。甚至简化的继承尝试也行不通：

class Foo(pd.DataFrame):
    def __init__(self, tag, df):
        self._tag = tag

Traceback (most recent call last):
  File "c:\Users\Michael\.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\ptvsd_launcher.py", line 43, in <module>
    main(ptvsdArgs)
  File "c:\Users\Michael\.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\lib\python\ptvsd\__main__.py", line 434, in main
    run()
  File "c:\Users\Michael\.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\lib\python\ptvsd\__main__.py", line 312, in run_file
    runpy.run_path(target, run_name='__main__')
  File "C:\Users\Michael\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\Michael\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\Michael\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "c:\Users\Michael\Desktop\sandbox\sandbox.py", line 8, in <module>
    foo = Foo('mytag', pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}))
  File "c:\Users\Michael\Desktop\sandbox\sandbox.py", line 6, in __init__
    self._tag = tag
  File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5205, in __setattr__
    existing = getattr(self, name)
  File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5178, in __getattr__
    if self._info_axis._can_hold_identifiers_and_holds_name(name):
  File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5178, in __getattr__
    if self._info_axis._can_hold_identifiers_and_holds_name(name):
  File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5178, in __getattr__
    if self._info_axis._can_hold_identifiers_and_holds_name(name):
  [Previous line repeated 487 more times]
  File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 489, in _info_axis
    return getattr(self, self._info_axis_name)
  File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5163, in __getattr__
    def __getattr__(self, name):
  File "c:\Users\Michael\.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\lib\python\ptvsd\_vendored\pydevd\_pydevd_bundle\pydevd_trace_dispatch_regular.py", line 362, in __call__
    is_stepping = pydev_step_cmd != -1
RecursionError: maximum recursion depth exceeded in comparison

我认为这是他们使用__getattribute__()的结果，当发现未知属性时会引发错误。它们是overriding the default __getattr__() behavior，我猜是继承的问题。

Answer 2

Michael的答案与我的调查结果相符。 DataFrame继承自NDFrame，它也覆盖了__setattr__，因此也可能导致此问题。

这里最直接的解决方案是创建一个使用数据框作为属性的类，以便可以设置您自己的属性。

class Foo:
    def __init__(self, tag, df):
        self.df = df
        self._tag = tag

*此外：如果本机dill无法使此类复杂对象腌制，我会考虑尝试pickle。在$ pip install dill之后，您只需要做import dill as pickle，因为它的方法名称与pickle相同。

Answer 3

我发了a similar question at almost the same time，真奇怪。在后续的评论中，我发现了更基本的东西：您在DataFrame子类中定义自己的元数据甚至无法幸免于SLICING操作。

创建foo实例后，打印并打印foo._tag，请尝试以下操作：

bar = foo[1:]
print(bar)
print(bar._tag)

这还会返回一个AttributeError，与您进行的酸洗操作一样。

切片时可能有充分的理由来更改甚至删除元数据。但是您可能很想保留它。我不知道Pandas代码中是否有一个点会同时影响切片和酸洗，但我怀疑确实存在。

无法解开从pandas DataFrame继承的类

3 个答案: