使用Gremlin在图中查找最长的循环路径

时间:2018-03-02 12:07:27

标签: graph datastax gremlin tinkerpop

我正在尝试构建Gremlin查询以在DSE Graph中使用并启用地理搜索(在Solr中编制索引)。问题是图形是如此密集地互连,以至于循环路径遍历超时。现在我正在使用的原型图有~1600个顶点和~35K边。通过每个顶点的三角形数量也总结如下:

+--------------------+-----+                                                    
|                 gps|count|
+--------------------+-----+
|POINT (-0.0462032...| 1502|
|POINT (-0.0458048...|  405|
|POINT (-0.0460680...|  488|
|POINT (-0.0478356...| 1176|
|POINT (-0.0479465...| 5566|
|POINT (-0.0481031...| 9896|
|POINT (-0.0484724...|  433|
|POINT (-0.0469379...|  302|
|POINT (-0.0456595...|  394|
|POINT (-0.0450722...|  614|
|POINT (-0.0475904...| 3080|
|POINT (-0.0479464...| 5566|
|POINT (-0.0483400...|  470|
|POINT (-0.0511753...|  370|
|POINT (-0.0521901...| 1746|
|POINT (-0.0519999...| 1026|
|POINT (-0.0468071...| 1247|
|POINT (-0.0469636...| 1165|
|POINT (-0.0463685...|  526|
|POINT (-0.0465805...| 1310|
+--------------------+-----+
only showing top 20 rows

我预计图表最终会变得很大,但我会将搜索周期限制在地理区域(比如半径~300米)。

到目前为止,我最好的尝试是以下的一些版本:

g.V().has('gps',Geo.point(lon, lat)).as('P')
.repeat(both()).until(cyclicPath()).path().by('gps')

Script evaluation exceeded the configured threshold of realtime_evaluation_timeout at 180000 ms for the request

为了便于说明,下图显示了绿色的起始顶点和红色的终止顶点。假设所有顶点都是互连的。我对绿色和红色之间的最长路径感兴趣,这将是围绕该块的环绕。 enter image description here

我读过的一些链接无济于事:

1)http://tinkerpop.apache.org/docs/current/recipes/#cycle-detection

2)Longest acyclic path in a directed unweighted graph

3)https://groups.google.com/forum/#!msg/gremlin-users/tc8zsoEWb5k/9X9LW-7bCgAJ

修改

使用Daniel的建议创建一个子图,它仍然超时:

gremlin> hood = g.V().hasLabel('image').has('gps', Geo.inside(point(-0.04813968113126384, 51.531259899256995), 100, Unit.METERS)).bothE().subgraph('hood').cap('hood').next()
==>tinkergraph[vertices:640 edges:28078]
gremlin> hg = hood.traversal()
==>graphtraversalsource[tinkergraph[vertices:640 edges:28078], standard]
gremlin> hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x')
==>v[{~label=image, partition_key=2507574903070261248, cluster_key=RFAHA095CLK-2017-09-14 12:52:31.613}]
gremlin> hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x').repeat(both().simplePath()).emit(where(both().as('x'))).both().where(eq('x')).tail(1).path()
Script evaluation exceeded the configured threshold of realtime_evaluation_timeout at 180000 ms for the request: [91b6f1fa-0626-40a3-9466-5d28c7b5c27c - hg.V().has('gps', Geo.point(-0.04813968113126384, 51.531259899256995)).as('x').repeat(both().simplePath()).emit(where(both().as('x'))).both().where(eq('x')).tail(1).path()]
Type ':help' or ':h' for help.
Display stack trace? [yN]n

2 个答案:

答案 0 :(得分:2)

基于跃点数的最长路径将是您可以找到的最后一条路径。

g.V().has('gps', Geo.point(x, y)).as('x').
  repeat(both().simplePath()).
    emit(where(both().as('x'))).
  both().where(eq('x')).tail(1).
  path()

除非你有一个非常小的(子)图,否则无法使这个查询在OLTP中表现良好。所以,取决于你所看到的"城市街区"在图表中,您应该首先将其作为子图提取,然后应用最长路径查询(在内存中)。

答案 1 :(得分:0)

我提出的一个解决方案涉及使用Spark GraphFrames和标签传播算法(GraphFramesLPA)。然后可以计算每个社区的平均GPS位置(事实上,您甚至不需要平均值,只需每个社区的单个成员就足够了)以及每个社区成员代表之间存在的所有边缘(平均或其他)。

选择并保存图形的一个区域并保存顶点和边缘:

g.V().has('gps', Geo.inside(Geo.point(x,y), radius, Unit.METERS))
.subgraph('g').cap(g')

Spark片段:

import org.graphframes.GraphFrame

val V = spark.read.json("v.json")
val E = spark.read.json("e.json")
val g = GraphFrame(V,E)
val result = g.labelPropagation.maxIter(5).run()

val rdd = result.select("fullgps", "label").map(row => {
    val coords = row.getString(0).split(",")
    val x = coords(0).toDouble
    val y = coords(1).toDouble
    val z = coords(2).toDouble
    val id = row.getLong(1)
    (x,y,z,id)
    }).rdd

// Average GPS:
val newVertexes = rdd.map{ case (x:Double,y:Double,z:Double,id:Long) => (id, (x,y,z)) }.toDF("lbl","gps")
rdd.map{ case (x:Double,y:Double,z:Double,id:Long) => (id, (x,y,z)) }.mapValues(value => (value,1)).reduceByKey{ case (((xL:Double,yL:Double,zL:Double), countL:Int), ((xR:Double,yR:Double,zR:Double), countR:Int)) => ((xR+xL,yR+yL,zR+yL),countR+countL) }.map{ case (id,((x,y,z),c)) => (id, ((x/c,y/c,z/c),c)) }.map{ case (id:Long, ((x:Double, y:Double, z:Double), count:Int)) => Array(x.toString,y.toString,z.toString,id.toString,count.toString) }.map(a => toCsv(a)).saveAsTextFile("avg_gps.csv")

// Keep IDs
val rdd2 = result.select("id", "label").map(row => {
       val id = row.getString(0)
       val lbl = row.getLong(1)
       (lbl, id) }).rdd

val edgeDF = E.select("dst","src").map(row => (row.getString(0),row.getString(1))).toDF("dst","src")

// Src
val tmp0 = result.select("id","label").join(edgeDF, result("id") === edgeDF("src")).withColumnRenamed("lbl","src_lbl")
val srcDF = tmp0.select("src","dst","label").map(row => { (row.getString(0)+"###"+row.getString(1),row.getLong(2)) }).withColumnRenamed("_1","src_lbl").withColumnRenamed("_2","src_edge")

// Dst
val tmp1 = result.select("id","label").join(edgeDF, result("id") === edgeDF("dst")).withColumnRenamed("lbl","dst_lbl")
val dstDF = tmp1.select("src","dst","label").map(row => { (row.getString(0)+"###"+row.getString(1),row.getLong(2)) }).withColumnRenamed("_1","dst_lbl").withColumnRenamed("_2","dst_edge")

val newE = srcDF.join(dstDF, srcDF("src_lbl")===dstDF("dst_lbl"))
val newEdges = newE.filter(newE("src_edge")=!=newE("dst_edge")).select("src_edge","dst_edge").map(row => { (row.getLong(0).toString + "###" + row.getLong(1).toString, row.getLong(0), row.getLong(1)) }).withColumnRenamed("_1","edge").withColumnRenamed("_2","src").withColumnRenamed("_3","dst").dropDuplicates("edge").select("src","dst")

val newGraph = GraphFrames(newVertexes, newEdges)

平均位置然后通过边连接,在这种情况下问题从~1600个顶点和~35K边减少到25个顶点和54个边:

enter image description here 这里的非绿色区段(红色,白色,黑色等)代表各个社区。绿色圆圈是平均GPS位置,其大小与每个社区中的成员数量成比例。现在,执行OLTP算法要容易得多,例如Daniel在上面的评论中提出的。