Question

我有下表：

我的问题是：如何以编程方式识别最终父母？

以下是通过示例解释的规则：

id 5.0的父级是51.0。标识51.0没有父级。因此，ID 5.0的最终父级为51.0。
id 6.0的父级是1.0。标识为1.0的父级为10.0。标识10.0没有父级。因此，ID 6.0的最终父级为10.0。
id 2.0没有父级。因此，2.0的最终parent_id为2.0

id 字段中没有重复项，我事先不知道id结构中可能存在多少级别的嵌套。

以下是此示例的代码：

import pandas as pd
import numpy as np

original_df = pd.DataFrame({'id': pd.Series([5., 6, 2, 51, 1, 70, 10])
              ,'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, np.nan, np.nan])})
original_df['ultimate_parent_id'] = ''
original_df

以下是决赛桌的样子：

以下是生成该文件的代码。

final_df = pd.DataFrame({'id': pd.Series([5., 6, 2, 51, 1, 70, 10])
              ,'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, np.nan, np.nan])})
final_df['ultimate_parent_id'] = pd.Series([51., 10, 2, 51, 10, 70, 10])
final_df

如果可能的话，我会对使用while循环的解决方案以及使用矢量化操作的解决方案非常感兴趣。

Answer 1

与@ Vaishali的回答一样，这是一个在主要操作上使用Python循环的版本，但在数据框中使用np / pd操作：

import pandas as pd
import numpy as np

df = pd.DataFrame(
        { 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
        'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
        }
    )

def find_ultimate_parents(df):
    # Make a copy of df, using 'id' as the index so we can lookup parent ids
    df2 = df.set_index(df['id'])
    df2['nextpar'] = df2['parent_id']

    # Next-parent-2 not null - fake it for now
    np2nn = df2['nextpar'].notnull()

    while np2nn.any():
        # Lookup df2[parent-id], since the index is now by id. Get the
        # parent-id (of the parent-id), put that value in nextpar2.
        # So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.

        # Set na_action='ignore' so any Nan doesn't bother looking up, just copies
        # the Nan to the next generation.
        df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')

        # Re-evaluate who is a Nan in the nextpar2 column.
        np2nn = df2['nextpar2'].notnull()

        # Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop
        # at the root.
        df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']

    # At this point, we've run out of parents to look up. df2['nextpar'] has
    # the "ultimate" parents.

    return df2['nextpar']


df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)

循环保护检查np2nn.any()，它是布尔系列上的向量op。每次遍历循环都会查找“下一个父级”，因此循环次数将是任何子级父链的最大深度。 O（N），中的最坏情况，对于列表如1-> 2-> 3-> 4-> ...... - > n。对于没有父母的列表，最好的情况是0。

循环执行.map na_action='ignore'以简单地传播Nan值。这是 O（fast-N）乘以索引查找的成本，应该 O（1）。

在计算出nextpar2字段后，循环使用简单的np2nn重新计算.notnull()，nextpar再次 O（快速N）。

最后，nextpar2,字段从<script src="https://unpkg.com/react-router/umd/react-router.min.js"></script> <script src="https://unpkg.com/react-router-dom/umd/react-router-dom.min.js"></script>更新，同样应该 O（快速N）。

因此，最坏情况下的性能是 O（慢-N *快N）， N²，但它是Pandas-N²，而不是Python- N²。平均情况应为 O（slow-m * fast-N），其中 m 是平均情况下的最大树深度，最佳情况是 O（快速） -N） 1快速通过行。

Answer 2

这是一个使用map和combine_first的解决方案。首先从df值创建字典以进行映射。现在使用parent_id上的map首先映射这些值，然后再次使用map将值映射到id。 Combine_first将确保从parent_id映射的值优先。最后combine_first用id。

填写NaN值

d = final_df.dropna().set_index('id').to_dict()
final_df['ultimate_parent_id'] = 
final_df['parent_id'].map(d['parent_id'])\
.combine_first(final_df['id'].map(d['parent_id']))\
.combine_first(final_df['id'])

你得到了

    id      parent_id   ultimate_parent_id
0   5.0     51.0        51.0
1   6.0     1.0         10.0
2   2.0     NaN         2.0
3   51.0    NaN         51.0
4   1.0     10.0        10.0
5   70.0    NaN         70.0
6   10.0    NaN         10.0

Answer 3

让我们首先清理DataFrame并摆脱${SDCARD}。负数是一个很好的替代品：

nan

将DataFrame转换为字典：

original_df = original_df.fillna(-1).astype(int)

现在，您需要一个递归函数将ID转换为最终的父ID：

d = original_df.set_index('id').to_dict()['parent_id']
#{1: 10, 2: -1, 51: -1, 5: 51, 6: 1, 10: -1, 70: -1}

将递归函数应用于每个字典键，将结果收集到另一个DataFrame中：

def translate(x):
    return x if d[x] == -1 else translate(d[x])

将结果与原始DataFrame结合使用：

ultimate = pd.DataFrame(pd.Series({x: translate(x) for x in d.keys()}), 
                 columns=('ultimate_parent_id', ))

Answer 4

除了@adhast的答案外，函数（find_ultimate_parents（df））的最后一行应为

return df2['nextpar'].values

df2使用df ['id']作为索引，因此与df的索引不对应。

下面是完整的脚本。

import pandas as pd
import numpy as np

df = pd.DataFrame(
    { 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
    'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
    }
)

def find_ultimate_parents(df):
    # Make a copy of df, using 'id' as the index so we can lookup parent ids
    df2 = df.set_index(df['id'])
    df2['nextpar'] = df2['parent_id']

    # Next-parent-2 not null - fake it for now
    np2nn = df2['nextpar'].notnull()

    while np2nn.any():
        # Lookup df2[parent-id], since the index is now by id. Get the
        # parent-id (of the parent-id), put that value in nextpar2.
        # So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.

        # Set na_action='ignore' so any Nan doesn't bother looking up, just copies
        # the Nan to the next generation.
        df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')

        # Re-evaluate who is a Nan in the nextpar2 column.
        np2nn = df2['nextpar2'].notnull()

        # Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop
        # at the root.
        df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']

    # At this point, we've run out of parents to look up. df2['nextpar'] has
    # the "ultimate" parents.

    return df2['nextpar'].values


df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)

如何使用python从嵌套表结构中识别最终父级？

4 个答案: