我正在使用Spark Graphframes库创建一个身份解析系统。我已经能够使用spark查找匹配项。我的计划是使用图表查找人与人之间的瞬时链接,并为他们分配一个ID,以供进一步分析等。
我使用了以下数据(来自公共febrl数据库):
顶点数据示例:
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
|given_name| surname|street_number| address_1| address_2| suburb|postcode|state|date_of_birth|soc_sec_id| id|block|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
| michaela| neumann| 8| stanley street| miami| winston hills| 4223| nsw| 19151111| 5304218| 0| mneu|
| courtney| painter| 12| pinkerton circuit| bega flats| richlands| 4560| vic| 19161214| 4066625| 1| cpai|
| charles| green| 38|salkauskas crescent| kela| dapto| 4566| nsw| 19480930| 4365168| 2| cgre|
| vanessa| parr| 905| macquoid place| broadbridge manor| south grafton| 2135| sa| 19951119| 9239102| 3| vpar|
| mikayla|malloney| 37| randwick road| avalind|hoppers crossing| 4552| vic| 19860208| 7207688| 4| mmal|
| blake| howie| 1| cutlack street|belmont park belt...| budgewoi| 6017| vic| 19250301| 5180548| 5| bhow|
| blakeston| broadby| 53| traeger street| valley of springs| north ward| 3083| qld| 19120907| 4308555| 7| bbro|
| edward| denholm| 10| corin place| gold tyne| clayfield| 4221| vic| 19660306| 7119771| 9| eden|
| charlie|alderson| 266|hawkesbury crescent|deergarden caravn...| cooma| 4128| vic| 19440908| 1256748| 10| cald|
| molly| roche| 59|willoughby crescent| donna valley| carrara| 4825| nsw| 19200712| 1847058| 11| mroc|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+
边缘数据样本:
+---+-----+-----+
|src| dst|match|
+---+-----+-----+
| 0|10000| 1|
| 1|17750| 1|
| 1|10001| 1|
| 1| 7750| 1|
| 2|19656| 1|
| 2|10002| 1|
| 2| 9656| 1|
| 3|19119| 1|
| 3|10003| 1|
| 3| 9119| 1|
+---+-----+-----+
创建的图形:
g = GraphFrame(vertix_data, edge_data)
已使用的已连接组件:
connected = g.connectedComponents(algorithm='graphframes')
结果为:
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
|given_name| surname|street_number| address_1| address_2| suburb|postcode|state|date_of_birth|soc_sec_id| id|block|component|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
| michaela| neumann| 8| stanley street| miami| winston hills| 4223| nsw| 19151111| 5304218| 0| mneu| 0|
| courtney| painter| 12| pinkerton circuit| bega flats| richlands| 4560| vic| 19161214| 4066625| 1| cpai| 1|
| charles| green| 38|salkauskas crescent| kela| dapto| 4566| nsw| 19480930| 4365168| 2| cgre| 2|
| vanessa| parr| 905| macquoid place| broadbridge manor| south grafton| 2135| sa| 19951119| 9239102| 3| vpar| 3|
| mikayla|malloney| 37| randwick road| avalind|hoppers crossing| 4552| vic| 19860208| 7207688| 4| mmal| 4|
| blake| howie| 1| cutlack street|belmont park belt...| budgewoi| 6017| vic| 19250301| 5180548| 5| bhow| 5|
| blakeston| broadby| 53| traeger street| valley of springs| north ward| 3083| qld| 19120907| 4308555| 7| bbro| 7|
| edward| denholm| 10| corin place| gold tyne| clayfield| 4221| vic| 19660306| 7119771| 9| eden| 9|
| charlie|alderson| 266|hawkesbury crescent|deergarden caravn...| cooma| 4128| vic| 19440908| 1256748| 10| cald| 10|
| molly| roche| 59|willoughby crescent| donna valley| carrara| 4825| nsw| 19200712| 1847058| 11| mroc| 11|
+----------+--------+-------------+-------------------+--------------------+----------------+--------+-----+-------------+----------+---+-----+---------+
component列并不总是以1的增量增加,而是似乎随机跳过数字,我想确保以1的增量增加,因为使用此数字为每个人分配了一个ID。 有人知道为什么使用Graphframes吗?
当我进一步研究时,对于开发数据框中的大约20,000行,其中大约17%的条目中有一个跳过项。在极端情况下,差距可能高达20-30,即,一行ID为5846,而下一行ID为5868。我担心的是,当我以成千上万的比例进行缩放时,ID之间的差距会变得非常大可能会产生问题。
TL; DR:为什么Sparks连接的组件似乎会随机跳过值而并不总是增加1?
答案 0 :(得分:0)
Graphframes文档从不保证连续的ID,相反,它提供的唯一保证是:
生成的DataFrame包含所有顶点信息和另外一列:
component(LongType):此组件的唯一ID
实际上,GraphX实现使用组件("return a graph with the vertex value containing the lowest vertex id in the connected component containing that vertex")和Graphframes seems to do the same thing中最小的ID。