我有一个看起来像这样的Pandas数据框
SelfID ParentID
0 A nan
1 B A
2 X nan
3 C B
4 D C
5 Y X
您可以看到有链链接回最终父母:例如D->C->B->A
。
我希望有一个单独的列,其中包含该组的最终祖先,以便我可以对它们使用groupby
作为一个整体。因此,A
,B
,C
和D
行在最后一列中都会有A
。
我能想到这样做的唯一方法是循环使用,使用字典存储父级的值。
是否有更好的方法,也许是不涉及循环的方法?
答案 0 :(得分:0)
这是一种仍在迭代的方法,但迭代次数将由祖先级别的最大数量而不是样本中的项目数决定。根据您的数据,这可能会更好。我们的想法是继续加入下一代,直到你到达没有人的一代。
ancestry = pd.DataFrame(dict(SelfID=["A","B","X","C","D","Y"], ParentID=[np.nan, "A",np.nan,"B","C","X"]))
ancestry = ancestry.set_index("SelfID")
earliest_ancestor = ancestry.rename(columns={"ParentID":"EarliestAncestorID"})
while True:
# Join the current to the previous generation
earliest_ancestor = pd.merge(earliest_ancestor, earliest_ancestor, left_on="EarliestAncestorID", right_index=True, how="left", suffixes=["_child", ""])
earliest_generation = earliest_ancestor.EarliestAncestorID
# Fillna to keep the earliest known ancestor
earliest_ancestor = earliest_ancestor.fillna(method="ffill", axis=1).drop("EarliestAncestorID_child", axis=1)
# If no one in this generation, we can stop going back
if earliest_generation.isnull().all():
break
ancestry = pd.concat((ancestry, earliest_ancestor), axis=1).reset_index()
# If no ancestors, self is the earliest ancestor
ancestry.loc[:, "EarliestAncestorID"] = ancestry.EarliestAncestorID.where(ancestry.EarliestAncestorID.notnull(), ancestry.SelfID)
print ancestry
哪个循环3次并给出:
SelfID ParentID EarliestAncestorID
0 A NaN A
1 B A A
2 X NaN X
3 C B A
4 D C A
5 Y X X
答案 1 :(得分:0)
import pandas as pd
from numpy import nan
def derive_ancestor(df):
id = dict(zip(df['SelfID'],df['ParentID']))
def ancestor(k):
return k if id[k] is nan else ancestor(id[k])
return [ ancestor(s) for s in df['SelfID'] ]
dframe = pd.DataFrame(data={'SelfID':('A','B','X','C','D','Y'),
'ParentID': (nan,'A',nan,'B','C','X')})
dframe.insert(2,'AncestorID',derive_ancestor(dframe))
print dframe