如何优化熊猫中的递归函数调用和内部循环?

时间:2019-06-16 08:02:12

标签: python pandas tree

我想从数据框中找到特定孩子的所有父母。我当前的代码需要花费20秒钟以上的时间来编译3000个数据点数据集。我认为这是因为我使用了递归函数调用和循环。您能帮我优化程序吗?

我尝试搜索子节点的父节点,将其打印并假定为子节点。然后递归地找到它的父母,依此类推,直到穷尽所有父母为止。

df = pd.DataFrame(
    {
        'parent_name': 
    ["Car","Tyre","Tyre","Rubber","Nylon","Nylon","Trees","Trees"],
    'child_name': ["Tyre","Rubber","Nylon","Trees","Chemicals","Man-made","Leaves","Stems"]
    }
)

使用所有这些定义一个函数以查找所有父节点

def get_parent_list(node_id):

    list_of_parents = []  

#define a function to find parent_names for all child_names   
    def find_parent(node_id):

       parent_names = df.loc[df["child_name"].isin([node_id]),"parent_name"]

       for parent_name in parent_names:
          list_of_parents.append(parent_name)
          find_parent(parent_name)

       find_parent(node_id)
       return list_of_parents

  df["list_of_parents"] = df["child_name"].apply(get_parent_list)

我会将收到的输出存储为数据帧中的单独列

此后,我将在数据框中搜索用户输入,并显示相应的parents列列表作为输出

预期输出:

如果用户输入:“树”作为输入

输出: 树木:橡胶,轮胎,汽车

1 个答案:

答案 0 :(得分:1)

这里最自然的是使用树数据结构,它将具有线性查询时间。尽管我很惊讶您的方法这么慢,因为3000个数据点并不庞大。

import pandas as pd
from treelib import Tree

df = pd.DataFrame(
    {
        "parent_name":
            ["Car", "Tyre", "Tyre", "Rubber", "Nylon", "Nylon", "Trees", "Trees"],
        "child_name": ["Tyre", "Rubber", "Nylon", "Trees", "Chemicals", "Man-made", "Leaves", "Stems"]
    }
)

tree = Tree()
tree.create_node(df["parent_name"][0], df["parent_name"][0])  # root
for i, row in df.iterrows():
    tree.create_node(row["child_name"], row["child_name"], parent=row["parent_name"])
tree.show()

def find_parents(child_name):
    child = tree[child_name]
    parent_names = []
    while child.bpointer is not None:
        parent = tree[child.bpointer]
        parent_names.append(parent.identifier)
        child = parent

    return parent_names


print(find_parents("Trees"))
df["list_of_parents"] = df["child_name"].apply(find_parents)

注意:如果您修改数据框,则必须在重新调用“ find_parents”函数之前重新创建树。如果您定期修改数据框,则可以选择在find_parents函数内部重新创建树。

编辑:@AkshayKannan,您好,抱歉!由于某些节点可能具有多个父节点,因此此处使用的适当结构不是树,而是有向无环图(DAG)。以下应该工作(我添加了一行(“尼龙”,“叶子”)来测试多父案例)

import pandas as pd
import networkx as nx

df = pd.DataFrame(
    {
        "parent_name":
            ["Car", "Tyre", "Tyre", "Rubber", "Nylon", "Nylon", "Trees", "Trees", "Nylon"],
        "child_name": ["Tyre", "Rubber", "Nylon", "Trees", "Chemicals", "Man-made", "Leaves", "Stems", "Leaves"]
    }
)

G = nx.DiGraph()
for i, row in df.iterrows():
    G.add_edge(row["child_name"], row["parent_name"])

nx.draw(G, with_labels=True)


def find_parents(child_name):
    return list(nx.descendants(G, child_name))


print(find_parents("Car"))
print(find_parents("Chemicals"))
print(find_parents("Leaves"))