我试图在识别原始密钥之间的关系之后创建一个唯一的合成密钥。
我的数据框:
Key Value
K1 1
K2 2
K2 3
K1 3
K2 4
K1 5
K3 6
K4 6
K5 7
预期结果:
Key Value New_Key
K1 1 NK1
K2 2 NK1
K2 3 NK1
K1 3 NK1
K2 4 NK1
K1 5 NK1
K2 6 NK2
K3 6 NK2
K4 7 NK3
我期待在python 3.0或pyspark中得到答复。
我用以下代码尝试过:
#Import libraries#
import networkx as nx
import pandas as pd
#Create DF#
d1=pd.DataFrame({'Key','Value'})
#Create Empty Graph#
G=nx.Graph()
#Create a list of edge tuples#
e=list(d1.iloc[0:].itertuples(index=False, name=None))
#Create a list of nodes/vertices#
v=list(set(d1.A).union(set(d1.B)))
#Add nodes and edges to the graph#
G.add_edges_from(e)
G.add_nodes_from(v)
#Get list connected components#
c=[c for c in sorted(nx.connected_components(G), key=None, reverse=False)] print(c)
谢谢。
答案 0 :(得分:0)
您要解决的问题称为称为连接组件的图形问题。您要做的就是将Keys
和Values
视为顶点并运行连接的组件算法。以下显示了使用pyspark和graphframes的解决方案。
import pyspark.sql.functions as F
from graphframes import *
sc.setCheckpointDir('/tmp/graphframes')
l = [('K1' , 1),
('K2' , 2),
('K2' , 3),
('K1' , 3),
('K2' , 4),
('K1' , 5),
('K3' , 6),
('K4' , 6),
('K5' , 7)]
columns = ['Key', 'Value']
df=spark.createDataFrame(l, columns)
#creating a graphframe
#an edge dataframe requires a src and a dst column
edges = df.withColumnRenamed('Key', 'src')\
.withColumnRenamed('Value', 'dst')
#a vertices dataframe requires a id column
vertices = df.select('Key').union(df.select('value')).withColumnRenamed('Key', 'id')
#this creates a graphframe...
g = GraphFrame(vertices, edges)
#which already has a function called connected components
cC = g.connectedComponents().withColumnRenamed('id', 'Key')
#now we join the connectedComponents dataframe with the original dataframe to add the new keys to it. I'm calling distinct here, as I'm currently getting multiple rows which I can't really explain at the moment
df = df.join(cC, 'Key', 'inner').distinct()
df.show()
输出:
+---+-----+------------+
|Key|Value| component|
+---+-----+------------+
| K3| 6|335007449088|
| K1| 5|154618822656|
| K1| 1|154618822656|
| K1| 3|154618822656|
| K2| 2|154618822656|
| K2| 3|154618822656|
| K2| 4|154618822656|
| K4| 6|335007449088|
| K5| 7| 25769803776|
+---+-----+------------+