根据(好)*祖父行设置行数

时间:2015-02-17 20:58:22

标签: python pandas

我有一个看起来像这样的Pandas数据框

   SelfID  ParentID 
0       A       nan
1       B         A
2       X       nan
3       C         B
4       D         C
5       Y         X

您可以看到有链链接回最终父母:例如D->C->B->A

我希望有一个单独的列,其中包含该组的最终祖先,以便我可以对它们使用groupby作为一个整体。因此,ABCD行在最后一列中都会有A

我能想到这样做的唯一方法是循环使用,使用字典存储父级的值。

是否有更好的方法,也许是不涉及循环的方法?

2 个答案:

答案 0 :(得分:0)

这是一种仍在迭代的方法,但迭代次数将由祖先级别的最大数量而不是样本中的项目数决定。根据您的数据,这可能会更好。我们的想法是继续加入下一代,直到你到达没有人的一代。

ancestry = pd.DataFrame(dict(SelfID=["A","B","X","C","D","Y"], ParentID=[np.nan, "A",np.nan,"B","C","X"]))
ancestry = ancestry.set_index("SelfID")
earliest_ancestor = ancestry.rename(columns={"ParentID":"EarliestAncestorID"})

while True:
    # Join the current to the previous generation
    earliest_ancestor = pd.merge(earliest_ancestor, earliest_ancestor, left_on="EarliestAncestorID", right_index=True, how="left", suffixes=["_child", ""])
    earliest_generation = earliest_ancestor.EarliestAncestorID
    # Fillna to keep the earliest known ancestor
    earliest_ancestor = earliest_ancestor.fillna(method="ffill", axis=1).drop("EarliestAncestorID_child", axis=1)
    # If no one in this generation, we can stop going back
    if earliest_generation.isnull().all():
        break

ancestry = pd.concat((ancestry, earliest_ancestor), axis=1).reset_index()
# If no ancestors, self is the earliest ancestor
ancestry.loc[:, "EarliestAncestorID"] =  ancestry.EarliestAncestorID.where(ancestry.EarliestAncestorID.notnull(), ancestry.SelfID)
print ancestry

哪个循环3次并给出:

  SelfID ParentID EarliestAncestorID
0      A      NaN                  A
1      B        A                  A
2      X      NaN                  X
3      C        B                  A
4      D        C                  A
5      Y        X                  X

答案 1 :(得分:0)

import pandas as pd
from numpy import nan

def derive_ancestor(df):
        id = dict(zip(df['SelfID'],df['ParentID']))

        def ancestor(k):
                return k if id[k] is nan else ancestor(id[k])

        return [ ancestor(s) for s in df['SelfID'] ]

dframe = pd.DataFrame(data={'SelfID':('A','B','X','C','D','Y'),
                           'ParentID': (nan,'A',nan,'B','C','X')})
dframe.insert(2,'AncestorID',derive_ancestor(dframe))
print dframe