使用GraphFrame查找隔离的顶点的ID的最佳方法是什么?在最新版本中,我们可以使用“ dropIsolatedVertices()”将其从图形中删除,但我也想知道它们的ID。
谢谢
答案 0 :(得分:0)
一个丑陋的解决方案是:
notiso = [row.id for row in g.dropIsolatedVertices().vertices.select("id").collect()]
iso = g.vertices.filter(g.vertices.id.isin(notiso) == False).select("id")
答案 1 :(得分:0)
另一种方法是使用connectedComponents
函数并过滤并最后选择您需要的内容
g=GraphFrame(vertices, edges)
get_connected_components=g.connectedComponents()
my_conncected_vertices = get_connected_components \
.select("id","component") \
.groupBy('component') \
.count() \
.withColumnRenamed("count", "n") \
.withColumnRenamed("component", "component2")
my_isolated_vertices=my_conncected_vertices \
.join(get_connected_components,get_connected_components.component==my_conncected_vertices.component2,how="left") \
.filter("n==1") \
.select("id") \
我的图表结果
get_connected_components.show()
+-------+-------------+
| id| component|
+-------+-------------+
| b6| 240518168576|
| a3| 343597383680|
| a4| 343597383680|
| c7| 498216206336|
| b2| 240518168576|
| c9| 498216206336|
| c5| 498216206336|
| c1| 498216206336|
| c6| 498216206336|
| a2| 343597383680|
| b3| 240518168576|
| b1| 240518168576|
| c8| 498216206336|
|alone11|1116691496960|
| a1| 343597383680|
| c4| 498216206336|
| c3| 498216206336|
|alone12|1340029796352|
| b4| 240518168576|
| c2| 498216206336|
+-------+-------------+
only showing top 20 rows
my_conncected_vertices.show()
+-------------+---------------+
| component_id|number_vertices|
+-------------+---------------+
| 240518168576| 6|
|1116691496960| 1|
|1340029796352| 1|
| 498216206336| 10|
| 343597383680| 4|
+-------------+---------------+
my_isolated_vertices.show()
+-------+
| id|
+-------+
|alone11|
|alone12|
+-------+