使用graphframe查找孤立的顶点ID

时间:2019-05-28 23:03:23

标签: apache-spark graphframes

使用GraphFrame查找隔离的顶点的ID的最佳方法是什么?在最新版本中,我们可以使用“ dropIsolatedVertices()”将其从图形中删除,但我也想知道它们的ID。

谢谢

2 个答案:

答案 0 :(得分:0)

一个丑陋的解决方案是:

notiso =  [row.id for row in g.dropIsolatedVertices().vertices.select("id").collect()]
iso = g.vertices.filter(g.vertices.id.isin(notiso) == False).select("id")

答案 1 :(得分:0)

另一种方法是使用connectedComponents函数并过滤并最后选择您需要的内容

g=GraphFrame(vertices, edges)
get_connected_components=g.connectedComponents()

my_conncected_vertices = get_connected_components \
    .select("id","component") \
    .groupBy('component') \
    .count() \
    .withColumnRenamed("count", "n") \
    .withColumnRenamed("component", "component2") 


my_isolated_vertices=my_conncected_vertices \
    .join(get_connected_components,get_connected_components.component==my_conncected_vertices.component2,how="left") \
    .filter("n==1") \
    .select("id") \

我的图表结果

get_connected_components.show()
+-------+-------------+
|     id|    component|
+-------+-------------+
|     b6| 240518168576|
|     a3| 343597383680|
|     a4| 343597383680|
|     c7| 498216206336|
|     b2| 240518168576|
|     c9| 498216206336|
|     c5| 498216206336|
|     c1| 498216206336|
|     c6| 498216206336|
|     a2| 343597383680|
|     b3| 240518168576|
|     b1| 240518168576|
|     c8| 498216206336|
|alone11|1116691496960|
|     a1| 343597383680|
|     c4| 498216206336|
|     c3| 498216206336|
|alone12|1340029796352|
|     b4| 240518168576|
|     c2| 498216206336|
+-------+-------------+
only showing top 20 rows


my_conncected_vertices.show()
+-------------+---------------+
| component_id|number_vertices|
+-------------+---------------+
| 240518168576|              6|
|1116691496960|              1|
|1340029796352|              1|
| 498216206336|             10|
| 343597383680|              4|
+-------------+---------------+

my_isolated_vertices.show()
+-------+
|     id|
+-------+
|alone11|
|alone12|
+-------+