我的输入数据框是df
valx valy
1: 600060 09283744
2: 600131 96733110
3: 600194 01700001
我想创建图,将上面两列作为边列表处理,然后我的输出应包含图的所有顶点及其成员资格的列表。
我也尝试在pyspark和networx库中使用Graphframe,但是没有得到想要的结果
我的输出应如下图所示(基本上是V1下的所有valx和valy(作为顶点)以及V2下的所有成员信息)
V1 V2
600060 1
96733110 1
01700001 3
我在下面尝试过
import networkx as nx
import pandas as pd
filelocation = r'Pathtodataframe df csv'
Panda_edgelist = pd.read_csv(filelocation)
g = nx.from_pandas_edgelist(Panda_edgelist,'valx','valy')
g2 = g.to_undirected(g)
list(g.nodes)
``
答案 0 :(得分:1)
我不确定您是否通过询问相同的问题two times违反这里的任何规则。
要检测带有图框的社区,首先必须创建一个图框对象。给您的示例数据帧以下代码片段向您展示必要的转换:
from graphframes import *
sc.setCheckpointDir("/tmp/connectedComponents")
l = [
( '600060' , '09283744'),
( '600131' , '96733110'),
( '600194' , '01700001')
]
columns = ['valx', 'valy']
#this is your input dataframe
edges = spark.createDataFrame(l, columns)
#graphframes requires two dataframes: an edge and a vertice dataframe.
#the edge dataframe has to have at least two columns labeled with src and dst.
edges = edges.withColumnRenamed('valx', 'src').withColumnRenamed('valy', 'dst')
edges.show()
#the vertice dataframe requires at least one column labeled with id
vertices = edges.select('src').union(edges.select('dst')).withColumnRenamed('src', 'id')
vertices.show()
g = GraphFrame(vertices, edges)
输出:
+------+--------+
| src| dst|
+------+--------+
|600060|09283744|
|600131|96733110|
|600194|01700001|
+------+--------+
+--------+
| id|
+--------+
| 600060|
| 600131|
| 600194|
|09283744|
|96733110|
|01700001|
+--------+
您在其他question的评论中写道,社区检测算法目前对您而言并不重要。因此,我将选择connected components:
result = g.connectedComponents()
result.show()
输出:
+--------+------------+
| id| component|
+--------+------------+
| 600060|163208757248|
| 600131| 34359738368|
| 600194|884763262976|
|09283744|163208757248|
|96733110| 34359738368|
|01700001|884763262976|
+--------+------------+
其他社区检测算法(例如LPA)可以在user guide中找到。