Graphframes / Graphx连接的组件跳过数字

时间:2018-12-17 15:45:43

标签: python apache-spark spark-graphx connected-components graphframes

我正在使用Spark Graphframes库创建一个身份解析系统。我已经能够使用spark查找匹配项。我的计划是使用图表查找人与人之间的瞬时链接,并为他们分配一个ID,以供进一步分析等。

我使用了以下数据(来自公共febrl数据库):

顶点数据示例:

+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
|given_name| surname|street_number|          address_1|           address_2|          suburb|postcode|state|date_of_birth|soc_sec_id| id|block|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
|  michaela| neumann|            8|     stanley street|               miami|   winston hills|    4223|  nsw|     19151111|   5304218|  0| mneu|
|  courtney| painter|           12|  pinkerton circuit|          bega flats|       richlands|    4560|  vic|     19161214|   4066625|  1| cpai|
|   charles|   green|           38|salkauskas crescent|                kela|           dapto|    4566|  nsw|     19480930|   4365168|  2| cgre|
|   vanessa|    parr|          905|     macquoid place|   broadbridge manor|   south grafton|    2135|   sa|     19951119|   9239102|  3| vpar|
|   mikayla|malloney|           37|      randwick road|             avalind|hoppers crossing|    4552|  vic|     19860208|   7207688|  4| mmal|
|     blake|   howie|            1|     cutlack street|belmont park belt...|        budgewoi|    6017|  vic|     19250301|   5180548|  5| bhow|
| blakeston| broadby|           53|     traeger street|   valley of springs|      north ward|    3083|  qld|     19120907|   4308555|  7| bbro|
|    edward| denholm|           10|        corin place|           gold tyne|       clayfield|    4221|  vic|     19660306|   7119771|  9| eden|
|   charlie|alderson|          266|hawkesbury crescent|deergarden caravn...|           cooma|    4128|  vic|     19440908|   1256748| 10| cald|
|     molly|   roche|           59|willoughby crescent|        donna valley|         carrara|    4825|  nsw|     19200712|   1847058| 11| mroc|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+

边缘数据样本:

+---+-----+-----+
|src|  dst|match|
+---+-----+-----+
|  0|10000|    1|
|  1|17750|    1|
|  1|10001|    1|
|  1| 7750|    1|
|  2|19656|    1|
|  2|10002|    1|
|  2| 9656|    1|
|  3|19119|    1|
|  3|10003|    1|
|  3| 9119|    1|
+---+-----+-----+

创建的图形:

g = GraphFrame(vertix_data, edge_data)

已使用的已连接组件:

connected = g.connectedComponents(algorithm='graphframes')

结果为:

+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
|given_name| surname|street_number|          address_1|           address_2|          suburb|postcode|state|date_of_birth|soc_sec_id| id|block|component|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
|  michaela| neumann|            8|     stanley street|               miami|   winston hills|    4223|  nsw|     19151111|   5304218|  0| mneu|        0|
|  courtney| painter|           12|  pinkerton circuit|          bega flats|       richlands|    4560|  vic|     19161214|   4066625|  1| cpai|        1|
|   charles|   green|           38|salkauskas crescent|                kela|           dapto|    4566|  nsw|     19480930|   4365168|  2| cgre|        2|
|   vanessa|    parr|          905|     macquoid place|   broadbridge manor|   south grafton|    2135|   sa|     19951119|   9239102|  3| vpar|        3|
|   mikayla|malloney|           37|      randwick road|             avalind|hoppers crossing|    4552|  vic|     19860208|   7207688|  4| mmal|        4|
|     blake|   howie|            1|     cutlack street|belmont park belt...|        budgewoi|    6017|  vic|     19250301|   5180548|  5| bhow|        5|
| blakeston| broadby|           53|     traeger street|   valley of springs|      north ward|    3083|  qld|     19120907|   4308555|  7| bbro|        7|
|    edward| denholm|           10|        corin place|           gold tyne|       clayfield|    4221|  vic|     19660306|   7119771|  9| eden|        9|
|   charlie|alderson|          266|hawkesbury crescent|deergarden caravn...|           cooma|    4128|  vic|     19440908|   1256748| 10| cald|       10|
|     molly|   roche|           59|willoughby crescent|        donna valley|         carrara|    4825|  nsw|     19200712|   1847058| 11| mroc|       11|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+

component列并不总是以1的增量增加,而是似乎随机跳过数字,我想确保以1的增量增加,因为使用此数字为每个人分配了一个ID。 有人知道为什么使用Graphframes吗?

当我进一步研究时,对于开发数据框中的大约20,000行,其中大约17%的条目中有一个跳过项。在极端情况下,差距可能高达20-30,即,一行ID为5846,而下一行ID为5868。我担心的是,当我以成千上万的比例进行缩放时,ID之间的差距会变得非常大可能会产生问题。

TL; DR:为什么Sparks连接的组件似乎会随机跳过值而并不总是增加1?

1 个答案:

答案 0 :(得分:0)

Graphframes文档从不保证连续的ID,相反,它提供的唯一保证是:

  

生成的DataFrame包含所有顶点信息和另外一列:

     

component(LongType):此组件的唯一ID

实际上,GraphX实现使用组件("return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex")和Graphframes seems to do the same thing中最小的ID。