如何使用python从嵌套表结构中识别最终父级?

时间:2017-08-19 00:00:32

标签: python pandas

我有下表:

enter image description here

我的问题是:如何以编程方式识别最终父母?

以下是通过示例解释的规则:

  • id 5.0的父级是51.0。标识51.0没有父级。因此,ID 5.0的最终父级为51.0
  • id 6.0的父级是1.0。标识为1.0的父级为10.0。标识10.0没有父级。因此,ID 6.0的最终父级为10.0
  • id 2.0没有父级。因此,2.0的最终parent_id为2.0

id 字段中没有重复项,我事先不知道id结构中可能存在多少级别的嵌套。

以下是此示例的代码:

import pandas as pd
import numpy as np

original_df = pd.DataFrame({'id': pd.Series([5., 6, 2, 51, 1, 70, 10])
              ,'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, np.nan, np.nan])})
original_df['ultimate_parent_id'] = ''
original_df

以下是决赛桌的样子:

enter image description here

以下是生成该文件的代码。

final_df = pd.DataFrame({'id': pd.Series([5., 6, 2, 51, 1, 70, 10])
              ,'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, np.nan, np.nan])})
final_df['ultimate_parent_id'] = pd.Series([51., 10, 2, 51, 10, 70, 10])
final_df

如果可能的话,我会对使用while循环的解决方案以及使用矢量化操作的解决方案非常感兴趣。

4 个答案:

答案 0 :(得分:2)

与@ Vaishali的回答一样,这是一个在主要操作上使用Python循环的版本,但在数据框中使用np / pd操作:

import pandas as pd
import numpy as np

df = pd.DataFrame(
        { 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
        'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
        }
    )

def find_ultimate_parents(df):
    # Make a copy of df, using 'id' as the index so we can lookup parent ids
    df2 = df.set_index(df['id'])
    df2['nextpar'] = df2['parent_id']

    # Next-parent-2 not null - fake it for now
    np2nn = df2['nextpar'].notnull()

    while np2nn.any():
        # Lookup df2[parent-id], since the index is now by id. Get the
        # parent-id (of the parent-id), put that value in nextpar2.
        # So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.

        # Set na_action='ignore' so any Nan doesn't bother looking up, just copies
        # the Nan to the next generation.
        df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')

        # Re-evaluate who is a Nan in the nextpar2 column.
        np2nn = df2['nextpar2'].notnull()

        # Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop
        # at the root.
        df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']

    # At this point, we've run out of parents to look up. df2['nextpar'] has
    # the "ultimate" parents.

    return df2['nextpar']


df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)

循环保护检查np2nn.any(),它是布尔系列上的向量op。每次遍历循环都会查找“下一个父级”,因此循环次数将是任何子级父链的最大深度。 O(N),中的最坏情况,对于列表如1-> 2-> 3-> 4-> ...... - > n。对于没有父母的列表,最好的情况是0。

循环执行.map na_action='ignore'以简单地传播Nan值。这是 O(fast-N)乘以索引查找的成本,应该 O(1)。

在计算出nextpar2字段后,循环使用简单的np2nn重新计算.notnull()nextpar再次 O(快速N)。

最后,nextpar2,字段从<script src="https://unpkg.com/react-router/umd/react-router.min.js"></script> <script src="https://unpkg.com/react-router-dom/umd/react-router-dom.min.js"></script> 更新,同样应该 O(快速N)。

因此,最坏情况下的性能是 O(慢-N *快N), ,但它是Pandas-N²,而不是Python- N²。平均情况应为 O(slow-m * fast-N),其中 m 是平均情况下的最大树深度,最佳情况是 O(快速) -N) 1快速通过行。

答案 1 :(得分:1)

这是一个使用map和combine_first的解决方案。首先从df值创建字典以进行映射。现在使用parent_id上的map首先映射这些值,然后再次使用map将值映射到id。 Combine_first将确保从parent_id映射的值优先。最后combine_first用id。

填写NaN值
d = final_df.dropna().set_index('id').to_dict()
final_df['ultimate_parent_id'] = 
final_df['parent_id'].map(d['parent_id'])\
.combine_first(final_df['id'].map(d['parent_id']))\
.combine_first(final_df['id'])

你得到了

    id      parent_id   ultimate_parent_id
0   5.0     51.0        51.0
1   6.0     1.0         10.0
2   2.0     NaN         2.0
3   51.0    NaN         51.0
4   1.0     10.0        10.0
5   70.0    NaN         70.0
6   10.0    NaN         10.0

答案 2 :(得分:1)

让我们首先清理DataFrame并摆脱${SDCARD}。负数是一个很好的替代品:

nan

将DataFrame转换为字典:

original_df = original_df.fillna(-1).astype(int)

现在,您需要一个递归函数将ID转换为最终的父ID:

d = original_df.set_index('id').to_dict()['parent_id']
#{1: 10, 2: -1, 51: -1, 5: 51, 6: 1, 10: -1, 70: -1}

将递归函数应用于每个字典键,将结果收集到另一个DataFrame中:

def translate(x):
    return x if d[x] == -1 else translate(d[x])

将结果与原始DataFrame结合使用:

ultimate = pd.DataFrame(pd.Series({x: translate(x) for x in d.keys()}), 
                 columns=('ultimate_parent_id', ))

答案 3 :(得分:0)

除了@adhast的答案外,函数(find_ultimate_parents(df))的最后一行应为

return df2['nextpar'].values

df2使用df ['id']作为索引,因此与df的索引不对应。

下面是完整的脚本。

import pandas as pd
import numpy as np

df = pd.DataFrame(
    { 'id': pd.Series([5., 6, 2, 51, 1, 70, 10]),
    'parent_id': pd.Series([51, 1, np.nan, np.nan, 10, 51, np.nan])
    }
)

def find_ultimate_parents(df):
    # Make a copy of df, using 'id' as the index so we can lookup parent ids
    df2 = df.set_index(df['id'])
    df2['nextpar'] = df2['parent_id']

    # Next-parent-2 not null - fake it for now
    np2nn = df2['nextpar'].notnull()

    while np2nn.any():
        # Lookup df2[parent-id], since the index is now by id. Get the
        # parent-id (of the parent-id), put that value in nextpar2.
        # So basically, if row B.nextpar has A, nextpar2 has (parent-of-A), or Nan.

        # Set na_action='ignore' so any Nan doesn't bother looking up, just copies
        # the Nan to the next generation.
        df2['nextpar2'] = df2['nextpar'].map(df2['parent_id'], na_action='ignore')

        # Re-evaluate who is a Nan in the nextpar2 column.
        np2nn = df2['nextpar2'].notnull()

        # Only update nextpar from nextpar2 if nextpar2 is not a Nan. Thus, stop
        # at the root.
        df2.loc[np2nn, 'nextpar'] = df2[np2nn]['nextpar2']

    # At this point, we've run out of parents to look up. df2['nextpar'] has
    # the "ultimate" parents.

    return df2['nextpar'].values


df['ultimate_parent_id'] = find_ultimate_parents(df)
print(df)