我有一个两列数据集,描述了形成一棵大树的多个子父关系。我想用它来构建每个节点的每个后代的更新列表。
child parent
1 2010 1000
7 2100 1000
5 2110 1000
3 3000 2110
2 3011 2010
4 3033 2100
0 3102 2010
6 3111 2110
descendant ancestor
0 2010 1000
1 2100 1000
2 2110 1000
3 3000 1000
4 3011 1000
5 3033 1000
6 3102 1000
7 3111 1000
8 3011 2010
9 3102 2010
10 3033 2100
11 3000 2110
12 3111 2110
最初我决定使用DataFrames的递归解决方案。它按预期工作,但熊猫的效率非常低。我的研究让我相信使用NumPy数组(或其他简单数据结构)的实现在大型数据集(数千个记录中的10个)上要快得多。
import pandas as pd
df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}, columns=['child', 'parent']
)
def get_ancestry_dataframe_flat(df):
def get_child_list(parent_id):
list_of_children = list()
list_of_children.append(df[df['parent'] == parent_id]['child'].values)
for i, r in df[df['parent'] == parent_id].iterrows():
if r['child'] != parent_id:
list_of_children.append(get_child_list(r['child']))
# flatten list
list_of_children = [item for sublist in list_of_children for item in sublist]
return list_of_children
new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
for index, row in df.iterrows():
temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
temp_df['ancestor'] = row['parent']
new_df = new_df.append(temp_df)
new_df = new_df\
.drop_duplicates()\
.sort_values(['ancestor', 'descendant'])\
.reset_index(drop=True)
return new_df
因为以这种方式使用pandas DataFrames对大型数据集效率非常低,我需要提高此操作的性能。我的理解是,这可以通过更好地使用更高效的数据结构来完成适合循环和递归。我想以最有效的方式执行相同的操作。
具体来说,我要求优化速度。
答案 0 :(得分:7)
这是一种构建dict的方法,以便更轻松地导航树。然后运行一次树,并将孩子添加到他们的祖父母及以上。最后将新数据添加到数据框中。
def add_children_of_children(dataframe, root_node):
# build a dict of lists to allow easy tree descent
tree = {}
for idx, (child, parent) in dataframe.iterrows():
tree.setdefault(parent, []).append(child)
data = []
def descend_tree(parent):
# get list of children of this parent
children = tree[parent]
# reverse order so that we can modify the list while looping
for child in reversed(children):
if child in tree:
# descend tree and find children which need to be added
lower_children = descend_tree(child)
# add children from below to parent at this level
data.extend([(c, parent) for c in lower_children])
# return lower children to parents above
children.extend(lower_children)
return children
descend_tree(root_node)
return dataframe.append(
pd.DataFrame(data, columns=dataframe.columns))
测试代码中有三种测试方法,即时间运行的秒数:
add_children_of_children()
来自上方。add_children_of_children()
,输出已排序。get_ancestry_dataframe_flat()
pandas实施。因此,原生数据结构方法比原始实现快得多。
import pandas as pd
df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}, columns=['child', 'parent']
)
def method1():
# the root node is the node which is not a child
root = set(df.parent) - set(df.child)
assert len(root) == 1, "Number of roots != 1 '{}'".format(root)
return add_children_of_children(df, root.pop())
def method2():
dataframe = method1()
names = ['ancestor', 'descendant']
rename = {o: n for o, n in zip(dataframe.columns, reversed(names))}
return dataframe.rename(columns=rename) \
.sort_values(names).reset_index(drop=True)
def method3():
return get_ancestry_dataframe_flat(df)
def get_ancestry_dataframe_flat(df):
def get_child_list(parent_id):
list_of_children = list()
list_of_children.append(
df[df['parent'] == parent_id]['child'].values)
for i, r in df[df['parent'] == parent_id].iterrows():
if r['child'] != parent_id:
list_of_children.append(get_child_list(r['child']))
# flatten list
list_of_children = [
item for sublist in list_of_children for item in sublist]
return list_of_children
new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
for index, row in df.iterrows():
temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
temp_df['ancestor'] = row['parent']
new_df = new_df.append(temp_df)
new_df = new_df\
.drop_duplicates()\
.sort_values(['ancestor', 'descendant'])\
.reset_index(drop=True)
return new_df
print(method2())
print(method3())
from timeit import timeit
print(timeit(method1, number=50))
print(timeit(method2, number=50))
print(timeit(method3, number=50))
descendant ancestor
0 2010 1000
1 2100 1000
2 2110 1000
3 3000 1000
4 3011 1000
5 3033 1000
6 3102 1000
7 3111 1000
8 3011 2010
9 3102 2010
10 3033 2100
11 3000 2110
12 3111 2110
descendant ancestor
0 2010 1000
1 2100 1000
2 2110 1000
3 3000 1000
4 3011 1000
5 3033 1000
6 3102 1000
7 3111 1000
8 3011 2010
9 3102 2010
10 3033 2100
11 3000 2110
12 3111 2110
0.0737142168563
0.153700592966
3.38558308083
答案 1 :(得分:4)
这是一种使用numpy一次迭代树一代的方法。
import numpy as np
import pandas as pd # only used to return a dataframe
def list_ancestors(edges):
"""
Take edge list of a rooted tree as a numpy array with shape (E, 2),
child nodes in edges[:, 0], parent nodes in edges[:, 1]
Return pandas dataframe of all descendant/ancestor node pairs
Ex:
df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
'parent': [100, 100, 200, 200, 201, 300]})
df
child parent
0 200 100
1 201 100
2 300 200
3 301 200
4 302 201
5 400 300
list_ancestors(df.values)
returns
descendant ancestor
0 200 100
1 201 100
2 300 200
3 300 100
4 301 200
5 301 100
6 302 201
7 302 100
8 400 300
9 400 200
10 400 100
"""
ancestors = []
for ar in trace_nodes(edges):
ancestors.append(np.c_[np.repeat(ar[:, 0], ar.shape[1]-1),
ar[:, 1:].flatten()])
return pd.DataFrame(np.concatenate(ancestors),
columns=['descendant', 'ancestor'])
def trace_nodes(edges):
"""
Take edge list of a rooted tree as a numpy array with shape (E, 2),
child nodes in edges[:, 0], parent nodes in edges[:, 1]
Yield numpy array with cross-section of tree and associated
ancestor nodes
Ex:
df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
'parent': [100, 100, 200, 200, 201, 300]})
df
child parent
0 200 100
1 201 100
2 300 200
3 301 200
4 302 201
5 400 300
trace_nodes(df.values)
yields
array([[200, 100],
[201, 100]])
array([[300, 200, 100],
[301, 200, 100],
[302, 201, 100]])
array([[400, 300, 200, 100]])
"""
mask = np.in1d(edges[:, 1], edges[:, 0])
gen_branches = edges[~mask]
edges = edges[mask]
yield gen_branches
while edges.size != 0:
mask = np.in1d(edges[:, 1], edges[:, 0])
next_gen = edges[~mask]
gen_branches = numpy_col_inner_many_to_one_join(next_gen, gen_branches)
edges = edges[mask]
yield gen_branches
def numpy_col_inner_many_to_one_join(ar1, ar2):
"""
Take two 2-d numpy arrays ar1 and ar2,
with no duplicate values in first column of ar2
Return inner join of ar1 and ar2 on
last column of ar1, first column of ar2
Ex:
ar1 = np.array([[1, 2, 3],
[4, 5, 3],
[6, 7, 8],
[9, 10, 11]])
ar2 = np.array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10],
[11, 12]])
numpy_col_inner_many_to_one_join(ar1, ar2)
returns
array([[ 1, 2, 3, 4],
[ 4, 5, 3, 4],
[ 9, 10, 11, 12]])
"""
ar1 = ar1[np.in1d(ar1[:, -1], ar2[:, 0])]
ar2 = ar2[np.in1d(ar2[:, 0], ar1[:, -1])]
if 'int' in ar1.dtype.name and ar1[:, -1].min() >= 0:
bins = np.bincount(ar1[:, -1])
counts = bins[bins.nonzero()[0]]
else:
counts = np.unique(ar1[:, -1], False, False, True)[1]
left = ar1[ar1[:, -1].argsort()]
right = ar2[ar2[:, 0].argsort()]
return np.concatenate([left[:, :-1],
right[np.repeat(np.arange(right.shape[0]),
counts)]], 1)
测试用例1& 2由@ taky2提供,测试用例3& 4分别比较了高大树结构和宽大树结构的性能 - 大多数用例可能位于中间的某个位置。
df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}
)
df2 = pd.DataFrame(
{
'child': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001,
3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311],
'parent': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102,
2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033]
}
)
df3 = pd.DataFrame(np.r_[np.c_[np.arange(1, 501), np.arange(500)],
np.c_[np.arange(501, 1001), np.arange(500)]],
columns=['child', 'parent'])
df4 = pd.DataFrame(np.r_[np.c_[np.arange(1, 101), np.repeat(0, 100)],
np.c_[np.arange(1001, 11001),
np.repeat(np.arange(1, 101), 100)]],
columns=['child', 'parent'])
%timeit get_ancestry_dataframe_flat(df)
10 loops, best of 3: 53.4 ms per loop
%timeit add_children_of_children(df)
1000 loops, best of 3: 1.13 ms per loop
%timeit all_descendants_nx(df)
1000 loops, best of 3: 675 µs per loop
%timeit list_ancestors(df.values)
1000 loops, best of 3: 391 µs per loop
%timeit get_ancestry_dataframe_flat(df2)
10 loops, best of 3: 168 ms per loop
%timeit add_children_of_children(df2)
1000 loops, best of 3: 1.8 ms per loop
%timeit all_descendants_nx(df2)
1000 loops, best of 3: 1.06 ms per loop
%timeit list_ancestors(df2.values)
1000 loops, best of 3: 933 µs per loop
%timeit add_children_of_children(df3)
10 loops, best of 3: 156 ms per loop
%timeit all_descendants_nx(df3)
1 loop, best of 3: 952 ms per loop
%timeit list_ancestors(df3.values)
10 loops, best of 3: 104 ms per loop
%timeit add_children_of_children(df4)
1 loop, best of 3: 503 ms per loop
%timeit all_descendants_nx(df4)
1 loop, best of 3: 238 ms per loop
%timeit list_ancestors(df4.values)
100 loops, best of 3: 2.96 ms per loop
备注:强>
get_ancestry_dataframe_flat
没有及时处理案件3& 4由于时间和记忆问题。
add_children_of_children
以在内部识别根节点,但允许假设一个唯一的根。第一行root_node = (set(dataframe.parent) - set(dataframe.child)).pop()
已添加。
all_descendants_nx
以接受数据帧作为参数,而不是从外部名称空间中提取。
演示正确行为的示例:
np.all(get_ancestry_dataframe_flat(df2).sort_values(['descendant', 'ancestor'])\
.reset_index(drop=True) ==\
list_ancestors(df2.values).sort_values(['descendant', 'ancestor'])\
.reset_index(drop=True))
Out[20]: True
答案 2 :(得分:2)
使用networkx的解决方案,文档中可能有一个更有效的方法,但这个嵌套循环可以解决问题。
import pandas as pd
from timeit import timeit
df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}, columns=['child', 'parent']
)
在networkx 2.0中,使用from_pandas_edgelist创建directed graph:
import networkx as nx
DiG = nx.from_pandas_edgelist(df, 'parent', 'child', create_using=nx.DiGraph())
for n1 in DiG.nodes():
for n2 in nx.ancestors(DiG, n1):
print(n1,n2)
3000 1000
3000 2110
3011 1000
3011 2010
2100 1000
2110 1000
3111 1000
3111 2110
3033 1000
3033 2100
2010 1000
3102 1000
3102 2010
包装成一个功能:
def all_descendants_nx():
DiG = nx.from_pandas_edgelist(df,'parent','child',create_using=nx.DiGraph())
return pd.DataFrame.from_records([(n1,n2) for n1 in DiG.nodes() for n2 in nx.ancestors(DiG, n1)], columns=['descendant','ancestor'])
print(timeit(all_descendants_nx, number=50)) #to compare to Stephen's nice answer
0.05033063516020775
all_descendants_nx()
descendant ancestor
0 3000 1000
1 3000 2110
2 3011 1000
3 3011 2010
4 2100 1000
5 2110 1000
6 3111 1000
7 3111 2110
8 3033 1000
9 3033 2100
10 2010 1000
11 3102 1000
12 3102 2010
答案 3 :(得分:0)
这是使用isin()和map
的一种方法df_new = df.append(df[df['parent'].isin(df['child'].values.tolist())])\
.reset_index(drop = True)
df_new.loc[df_new.duplicated(), 'parent'] = df_new.loc[df_new.duplicated(), 'parent']\
.map(df.set_index('child')['parent'])
df_new = df_new.sort_values('parent').reset_index(drop=True)
df_new.columns = [' descendant' , 'ancestor']
你得到了
descendant ancestor
0 2010 1000
1 2100 1000
2 2110 1000
3 3000 1000
4 3011 1000
5 3033 1000
6 3102 1000
7 3111 1000
8 3011 2010
9 3102 2010
10 3033 2100
11 3000 2110
12 3111 2110