如何从熊猫数据帧创建多关系边缘列表?

时间:2020-07-02 15:48:29

标签: python pandas performance dataframe

我有一个这样的熊猫数据框:

 from itertools import * 
 from pandas as pd
 d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
 df_rel = pd.DataFrame(data=d)
 df_rel
       col1 col2
    0   a   XX
    1   b   XX
    2   c   XY
    3   d   XX
    4   a   YY
    5   b   YY
    6   d   XY

唯一节点是:

uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)

可以为每个Relationship生成源(Src)和目标(Dst):

df1 = pd.DataFrame(
    data=list(combinations(uniq_nodes, 2)), 
    columns=['Src', 'Dst'])
df1
  Src   Dst
0   a   b
1   a   c
2   a   d
3   b   c
4   b   d
5   c   d

我需要基于newdf的{​​{1}}中共享元素的新数据帧col2df_rel列来自Relationship。因此,带有边列表的期望数据帧将为:

col2

有没有最快的方法来实现这一目标?原始数据框具有30,000行。

2 个答案:

答案 0 :(得分:0)

我采用了这种方法。它可以工作,但对于大型数据框仍然不是很快。

 from itertools import * 
 from pandas as pd
 d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
 df_rel = pd.DataFrame(data=d)
 df_rel
       col1 col2
    0   a   XX
    1   b   XX
    2   c   XY
    3   d   XX
    4   a   YY
    5   b   YY
    6   d   XY   

uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
df1 = pd.DataFrame(
            data=list(combinations(unique_nodes, 2)),
            columns=['Src', 'Dst'])
     
filter1 = df_rel['col1'].isin(df1['Src'])
src_df = df_rel[filter1]
src_df.rename(columns={'col1':'Src'}, inplace=True)
filter2 = df_rel['col1'].isin(df1['Dst'])
dst_df = df_rel[filter2]
dst_df.rename(columns={'col1':'Dst'}, inplace=True)
new_df = pd.merge(src_df,dst_df, on = "col2",how="inner")
print ("after removing the duplicates")
new_df = new_df.drop_duplicates()
print(new_df.shape)
print ("after removing self loop")
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df.rename(columns={'col2':'Relationship'}, inplace=True)
print(new_df.shape)
print (new_df)
           Src Relationship Dst
        0   a           XX   b
        1   a           XX   d
        3   b           XX   d
        5   c           XY   d
        6   a           YY   b

答案 1 :(得分:0)

您需要遍历df1行,并从df_rel查找与df1['Src']df1['Dst']列匹配的行。获得df1['col2']Src的{​​{1}}值后,将它们进行比较,如果匹配,则在Dst中创建一行。试试看-检查它是否适用于大型数据集

数据设置(与您的设置相同)

newdf

代码:

d = {'col1': ['a', 'b', 'c', 'd', 'a', 'b', 'd'], 'col2': ['XX', 'XX', 'XY', 'XX', 'YY', 'YY', 'XY']}
df_rel = pd.DataFrame(data=d)

uniq_nodes = df_rel['col1'].unique()

df1 = pd.DataFrame(data=list(combinations(uniq_nodes, 2)),  columns=['Src', 'Dst'])

结果:

newdf = pd.DataFrame(columns=['Src','Dst','Relationship'])
for i,  row in df1.iterrows():
    src = (df_rel[df_rel['col1'] == row['Src']]['col2']).to_list()
    dst = (df_rel[df_rel['col1'] == row['Dst']]['col2']).to_list()
    for x in src:
        if x in dst:
            newdf = newdf.append(pd.Series({'Src': row['Src'], 'Dst': row['Dst'], 'Relationship': x}),
                                 ignore_index=True, sort=False)

print(newdf)