我有下表:
我的问题是:如何以编程方式识别最终父母?
以下是通过示例解释的规则:
5.0
的父级是51.0
。标识51.0
没有父级。因此,ID 5.0
的最终父级为51.0
。6.0
的父级是1.0
。标识为1.0
的父级为10.0
。标识10.0
没有父级。因此,ID 6.0
的最终父级为10.0
。2.0
没有父级。因此,2.0
的最终parent_id为2.0
id 字段中没有重复项,我事先不知道id结构中可能存在多少级别的嵌套。
以下是此示例的代码:
import pandas as pd
import numpy as np
original_df = pd.DataFrame({'id': pd.Series([5., 6, 2, 51, 1, 70, 10])
,'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, np.nan, np.nan])})
original_df['ultimate_parent_id'] = ''
original_df
以下是决赛桌的样子:
以下是生成该文件的代码。
final_df = pd.DataFrame({'id': pd.Series([5., 6, 2, 51, 1, 70, 10])
,'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, np.nan, np.nan])})
final_df['ultimate_parent_id'] = pd.Series([51., 10, 2, 51, 10, 70, 10])
final_df
如果可能的话,我会对使用while循环的解决方案以及使用矢量化操作的解决方案非常感兴趣。
答案 0 :(得分:2)
与@ Vaishali的回答一样,这是一个在主要操作上使用Python循环的版本,但在数据框中使用np / pd
操作:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{ 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
}
)
def find_ultimate_parents(df):
# Make a copy of df, using 'id' as the index so we can lookup parent ids
df2 = df.set_index(df['id'])
df2['nextpar'] = df2['parent_id']
# Next-parent-2 not null - fake it for now
np2nn = df2['nextpar'].notnull()
while np2nn.any():
# Lookup df2[parent-id], since the index is now by id. Get the
# parent-id (of the parent-id), put that value in nextpar2.
# So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.
# Set na_action='ignore' so any Nan doesn't bother looking up, just copies
# the Nan to the next generation.
df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')
# Re-evaluate who is a Nan in the nextpar2 column.
np2nn = df2['nextpar2'].notnull()
# Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop
# at the root.
df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']
# At this point, we've run out of parents to look up. df2['nextpar'] has
# the "ultimate" parents.
return df2['nextpar']
df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)
循环保护检查np2nn.any()
,它是布尔系列上的向量op。每次遍历循环都会查找“下一个父级”,因此循环次数将是任何子级父链的最大深度。 O(N),中的最坏情况,对于列表如1-> 2-> 3-> 4-> ...... - > n。对于没有父母的列表,最好的情况是0。
循环执行.map
na_action='ignore'
以简单地传播Nan值。这是 O(fast-N)乘以索引查找的成本,应该 O(1)。
在计算出nextpar2
字段后,循环使用简单的np2nn
重新计算.notnull()
,nextpar
再次 O(快速N)。
最后,nextpar2,
字段从<script src="https://unpkg.com/react-router/umd/react-router.min.js"></script>
<script src="https://unpkg.com/react-router-dom/umd/react-router-dom.min.js"></script>
更新,同样应该 O(快速N)。
因此,最坏情况下的性能是 O(慢-N *快N), N²,但它是Pandas-N²,而不是Python- N²。平均情况应为 O(slow-m * fast-N),其中 m 是平均情况下的最大树深度,最佳情况是 O(快速) -N) 1快速通过行。
答案 1 :(得分:1)
这是一个使用map和combine_first的解决方案。首先从df值创建字典以进行映射。现在使用parent_id上的map首先映射这些值,然后再次使用map将值映射到id。 Combine_first将确保从parent_id映射的值优先。最后combine_first用id。
填写NaN值d = final_df.dropna().set_index('id').to_dict()
final_df['ultimate_parent_id'] =
final_df['parent_id'].map(d['parent_id'])\
.combine_first(final_df['id'].map(d['parent_id']))\
.combine_first(final_df['id'])
你得到了
id parent_id ultimate_parent_id
0 5.0 51.0 51.0
1 6.0 1.0 10.0
2 2.0 NaN 2.0
3 51.0 NaN 51.0
4 1.0 10.0 10.0
5 70.0 NaN 70.0
6 10.0 NaN 10.0
答案 2 :(得分:1)
让我们首先清理DataFrame并摆脱${SDCARD}
。负数是一个很好的替代品:
nan
将DataFrame转换为字典:
original_df = original_df.fillna(-1).astype(int)
现在,您需要一个递归函数将ID转换为最终的父ID:
d = original_df.set_index('id').to_dict()['parent_id']
#{1: 10, 2: -1, 51: -1, 5: 51, 6: 1, 10: -1, 70: -1}
将递归函数应用于每个字典键,将结果收集到另一个DataFrame中:
def translate(x):
return x if d[x] == -1 else translate(d[x])
将结果与原始DataFrame结合使用:
ultimate = pd.DataFrame(pd.Series({x: translate(x) for x in d.keys()}),
columns=('ultimate_parent_id', ))
答案 3 :(得分:0)
除了@adhast的答案外,函数(find_ultimate_parents(df))的最后一行应为
return df2['nextpar'].values
df2使用df ['id']作为索引,因此与df的索引不对应。
下面是完整的脚本。
import pandas as pd
import numpy as np
df = pd.DataFrame(
{ 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
}
)
def find_ultimate_parents(df):
# Make a copy of df, using 'id' as the index so we can lookup parent ids
df2 = df.set_index(df['id'])
df2['nextpar'] = df2['parent_id']
# Next-parent-2 not null - fake it for now
np2nn = df2['nextpar'].notnull()
while np2nn.any():
# Lookup df2[parent-id], since the index is now by id. Get the
# parent-id (of the parent-id), put that value in nextpar2.
# So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.
# Set na_action='ignore' so any Nan doesn't bother looking up, just copies
# the Nan to the next generation.
df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')
# Re-evaluate who is a Nan in the nextpar2 column.
np2nn = df2['nextpar2'].notnull()
# Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop
# at the root.
df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']
# At this point, we've run out of parents to look up. df2['nextpar'] has
# the "ultimate" parents.
return df2['nextpar'].values
df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)