Spark GraphX:通过传递三元组中的顶点值进行过滤

时间:2017-03-30 12:16:46

标签: scala apache-spark spark-graphx

我在Windows 10上使用Spark 2.1.0。由于我是Spark新手,我正在关注此tutorial

在本教程中,作者使用以下代码打印图表的所有三元组:

graph.triplets.sortBy(_.attr, ascending=false).map(triplet =>
"There were " + triplet.attr.toString + " flights from " + triplet.srcAttr + " to " + triplet.dstAttr + ".").take(10)

问题:我想提供输入(例如ATL),我希望看到来自ATL的所有出境航班及其计数如下所示:

res60: Array[String] = Array(There were 1388 flights from ATL to LAX.,
There were 1330 flights from ATL to SFO., There were 1283 flights from ATL to HNL., 
There were 1205 flights from ATL to BOS., There were 1229 flights from ATL to LGA., 
There were 1214 flights from ATL to OGG., There were 1173 flights from ATL to LAS., 
There were 1157 flights from ATL to SAN.)

1 个答案:

答案 0 :(得分:0)

以下是代码:

// Selecting the desired airport
val input = "ATL"
// filtering the edges of the desired airport (here "ATL") from the `graph`(which is built on the full data)
val TEMPEdge = graph.edges.filter { case Edge(src, dst, prop) => src == MurmurHash3.stringHash(input) }
// Creating a new graph with the filtered edges
val TEMPGraph = Graph(airportVertices, TEMPEdge, defaultAirport)
// Printing the top 10
TEMPGraph.triplets.sortBy(_.attr, ascending=false).map(triplet => "There were " + triplet.attr.toString + " flights from " + triplet.srcAttr + " to " + triplet.dstAttr + "\n").take(10)

或者,我们可以使用过滤器

graph.triplets.sortBy(_.attr, ascending=false).filter {_.dstAttr == input }.map(triplet => "There were " + triplet.attr.toString + " flights from " + triplet.srcAttr + " to " + triplet.dstAttr + "\n").take(3)