我有一个这样的熊猫数据框:
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
唯一节点是:
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
可以为每个Relationship
生成源(Src)和目标(Dst):
df1 = pd.DataFrame(
data=list(combinations(uniq_nodes, 2)),
columns=['Src', 'Dst'])
df1
Src Dst
0 a b
1 a c
2 a d
3 b c
4 b d
5 c d
我需要基于newdf
的{{1}}中共享元素的新数据帧col2
。 df_rel
列来自Relationship
。因此,带有边列表的期望数据帧将为:
col2
有没有最快的方法来实现这一目标?原始数据框具有30,000行。
答案 0 :(得分:0)
我采用了这种方法。它可以工作,但对于大型数据框仍然不是很快。
from itertools import *
from pandas as pd
d = {'col1': ['a', 'b','c','d','a','b','d'], 'col2': ['XX','XX','XY','XX','YY','YY','XY']}
df_rel = pd.DataFrame(data=d)
df_rel
col1 col2
0 a XX
1 b XX
2 c XY
3 d XX
4 a YY
5 b YY
6 d XY
uniq_nodes = df_rel['col1'].unique()
uniq_nodes
array(['a', 'b', 'c', 'd'], dtype=object)
df1 = pd.DataFrame(
data=list(combinations(unique_nodes, 2)),
columns=['Src', 'Dst'])
filter1 = df_rel['col1'].isin(df1['Src'])
src_df = df_rel[filter1]
src_df.rename(columns={'col1':'Src'}, inplace=True)
filter2 = df_rel['col1'].isin(df1['Dst'])
dst_df = df_rel[filter2]
dst_df.rename(columns={'col1':'Dst'}, inplace=True)
new_df = pd.merge(src_df,dst_df, on = "col2",how="inner")
print ("after removing the duplicates")
new_df = new_df.drop_duplicates()
print(new_df.shape)
print ("after removing self loop")
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df = new_df[new_df['Src'] != new_df['Dst']]
new_df.rename(columns={'col2':'Relationship'}, inplace=True)
print(new_df.shape)
print (new_df)
Src Relationship Dst
0 a XX b
1 a XX d
3 b XX d
5 c XY d
6 a YY b
答案 1 :(得分:0)
您需要遍历df1
行,并从df_rel
查找与df1['Src']
和df1['Dst']
列匹配的行。获得df1['col2']
和Src
的{{1}}值后,将它们进行比较,如果匹配,则在Dst
中创建一行。试试看-检查它是否适用于大型数据集
数据设置(与您的设置相同)
newdf
代码:
d = {'col1': ['a', 'b', 'c', 'd', 'a', 'b', 'd'], 'col2': ['XX', 'XX', 'XY', 'XX', 'YY', 'YY', 'XY']}
df_rel = pd.DataFrame(data=d)
uniq_nodes = df_rel['col1'].unique()
df1 = pd.DataFrame(data=list(combinations(uniq_nodes, 2)), columns=['Src', 'Dst'])
结果:
newdf = pd.DataFrame(columns=['Src','Dst','Relationship'])
for i, row in df1.iterrows():
src = (df_rel[df_rel['col1'] == row['Src']]['col2']).to_list()
dst = (df_rel[df_rel['col1'] == row['Dst']]['col2']).to_list()
for x in src:
if x in dst:
newdf = newdf.append(pd.Series({'Src': row['Src'], 'Dst': row['Dst'], 'Relationship': x}),
ignore_index=True, sort=False)
print(newdf)