我想从数据框中找到特定孩子的所有父母。我当前的代码需要花费20秒钟以上的时间来编译3000个数据点数据集。我认为这是因为我使用了递归函数调用和循环。您能帮我优化程序吗?
我尝试搜索子节点的父节点,将其打印并假定为子节点。然后递归地找到它的父母,依此类推,直到穷尽所有父母为止。
df = pd.DataFrame(
{
'parent_name':
["Car","Tyre","Tyre","Rubber","Nylon","Nylon","Trees","Trees"],
'child_name': ["Tyre","Rubber","Nylon","Trees","Chemicals","Man-made","Leaves","Stems"]
}
)
def get_parent_list(node_id):
list_of_parents = []
#define a function to find parent_names for all child_names
def find_parent(node_id):
parent_names = df.loc[df["child_name"].isin([node_id]),"parent_name"]
for parent_name in parent_names:
list_of_parents.append(parent_name)
find_parent(parent_name)
find_parent(node_id)
return list_of_parents
df["list_of_parents"] = df["child_name"].apply(get_parent_list)
预期输出:
如果用户输入:“树”作为输入
输出: 树木:橡胶,轮胎,汽车
答案 0 :(得分:1)
这里最自然的是使用树数据结构,它将具有线性查询时间。尽管我很惊讶您的方法这么慢,因为3000个数据点并不庞大。
import pandas as pd
from treelib import Tree
df = pd.DataFrame(
{
"parent_name":
["Car", "Tyre", "Tyre", "Rubber", "Nylon", "Nylon", "Trees", "Trees"],
"child_name": ["Tyre", "Rubber", "Nylon", "Trees", "Chemicals", "Man-made", "Leaves", "Stems"]
}
)
tree = Tree()
tree.create_node(df["parent_name"][0], df["parent_name"][0]) # root
for i, row in df.iterrows():
tree.create_node(row["child_name"], row["child_name"], parent=row["parent_name"])
tree.show()
def find_parents(child_name):
child = tree[child_name]
parent_names = []
while child.bpointer is not None:
parent = tree[child.bpointer]
parent_names.append(parent.identifier)
child = parent
return parent_names
print(find_parents("Trees"))
df["list_of_parents"] = df["child_name"].apply(find_parents)
注意:如果您修改数据框,则必须在重新调用“ find_parents”函数之前重新创建树。如果您定期修改数据框,则可以选择在find_parents函数内部重新创建树。
编辑:@AkshayKannan,您好,抱歉!由于某些节点可能具有多个父节点,因此此处使用的适当结构不是树,而是有向无环图(DAG)。以下应该工作(我添加了一行(“尼龙”,“叶子”)来测试多父案例)
import pandas as pd
import networkx as nx
df = pd.DataFrame(
{
"parent_name":
["Car", "Tyre", "Tyre", "Rubber", "Nylon", "Nylon", "Trees", "Trees", "Nylon"],
"child_name": ["Tyre", "Rubber", "Nylon", "Trees", "Chemicals", "Man-made", "Leaves", "Stems", "Leaves"]
}
)
G = nx.DiGraph()
for i, row in df.iterrows():
G.add_edge(row["child_name"], row["parent_name"])
nx.draw(G, with_labels=True)
def find_parents(child_name):
return list(nx.descendants(G, child_name))
print(find_parents("Car"))
print(find_parents("Chemicals"))
print(find_parents("Leaves"))