我想做什么：

Question

我想做什么：

我目前正在使用python 3构建pandas DataFrame的子类。该类的一个特性是用户输入他们的data以及要用于的列的名称在施工时为班级创建MultiIndex。但是，我正在努力找到一个干净的方法来做到这一点。

战略失败＃1

我的第一次尝试看起来类似于以下内容，我在调用DataFrame构造函数之前尝试构造数据和索引值：

class DFSubClass(pd.DataFrame):
    @property
    def _constructor(self):
        return DFSubClass

    def __init__(self, data=None, #other DataFrame parameters#,
                 col_for_multi_index = None):

        multi_index = CreateMultiIndex(data, col_for_multi_index)
        data_subset = RemoveIndexColumnsFromData(data, col_for_multi_index)

        super(DFSubClass,self).__init__(data = data_subset,
                                                 index = multi_index,
                                                 #other DataFrame parameters#)

        multi_index = ComputeMultiIndexFromColumns(data, col_for_multi_index)

        self = self.set_index(multi_index)

虽然我能够制作一些我认为适用于ComputeMultiIndexFromColumns()的东西：

def ComputeMultiIndexFromColumns(data = None, cols = None):
    index_values = [np.array(data[i]) for i in cols]
    index = pd.MultiIndex.from_arrays(index_values, names=cols)
    return index

我无法找出RemoveIndexColumnsFromData()能够干净地处理pandas构造函数可以接受的所有不同数据类型的任何内容（即numpy arrays，dicts，其他DataFrame S）。此外，当输入为DataFrame时，我遇到了this problem，其中构造函数返回所有NaN，因为之前的索引与新的index值不匹配。 / p>

战略失败＃2

此时我决定不重新发明轮子并让pandas包通过首先调用DataFrame构造函数然后使用set_index()功能重新编制索引来处理这些问题数据：

class DFSubClass(pd.DataFrame):
    @property
    def _constructor(self):
        return DFSubClass

    def __init__(self, data=None, #other DataFrame parameters#,
                 col_for_multi_index = None):

        super(DFSubClass,self).__init__(data = data,
                                                 #other DataFrame parameters#)

        multi_index = ComputeMultiIndexFromColumns(data, col_for_multi_index)

        self = self.set_index(multi_index)

神圣的无限递归蝙蝠侠！事实证明，set_index()函数调用构造函数以重新索引DataFrame，这意味着此函数只是永远调用它自己。

我现在在哪里

我感觉有点卡住了。回到第一个策略似乎就像我需要做的那样，但我对处理所有数据类型有点犹豫，特别是当pandas已经解决了这个问题时。如果有人知道我怎么能1）利用pandas中已有的功能干净利落地完成这项工作，或2）解决这个问题的替代策略，我将非常感激。

Answer 1

关键是最终使用inplace=True，所以我的最终类定义看起来像这样。

class DFSubClass(pd.DataFrame):

    @property
    def _constructor(self):
        return DFSubClass

    def __init__(self, data=None, #other DataFrame parameters#,
                 col_for_multi_index = None):

        super(DFSubClass,self).__init__(data = data
                                        #, other DataFrame parameters#
                                        )

        self = self.set_index(col_for_multi_index, inplace = True)

inplace=True阻止调用构造函数并防止无限递归问题。

请注意，如果data对象已经设置了索引，则会从数据中删除这些列。如果您希望将这些列重置为DFSubClass，则需要先调用reset_index(inplace=True)。然而，这有一个缺点，如果索引只是默认索引，reset_index()将在DFSubClass中为您提供一个新列，它只是从0到DFSubClass.size[0]的值。以下代码可以防止这种情况发生：

if not isinstance(self.index, pd.Int64Index): 
    self.reset_index(inplace=True)

但是，如果索引是从reset_index()继承的任何类，例如Int64Index，这也会阻止对DateTimeIndex的调用。我还没有找到一个干净利落的方法，所以目前我只有一个功能可以检查self.index是pd.Int64Index但是不知道我认识的其他任何类别除pd.Int64Index以外的pd.RangeIndex继承。

在pandas dataframe子类

我想做什么：

战略失败＃1

战略失败＃2

我现在在哪里

1 个答案: