元组

时间:2017-10-03 19:53:13

标签: scala apache-spark spark-streaming spark-graphx

在处理了一些输入数据之后,我手头有一个RDD [(String,String,Long)],比如 input

input: org.apache.spark.rdd.RDD[(String, String, Long)] = MapPartitionsRDD[9] at flatMap at <console>:54

此处的字符串字段表示图形的顶点,而长字段是边缘的权重。

要创建一个图形,首先我将顶点插入到具有唯一id的地图中,如果已知顶点。如果已经遇到过,我使用先前分配的顶点id。基本上,每个顶点都分配了一个Long类型的唯一ID,然后我想创建边缘。

这是我正在做的事情:

var vertexMap = collection.mutable.Map[String, Long]()
var vid : Long = 0          // global vertex id counter
var srcVid : Long = 0       // source vertex id
var dstVid : Long = 0       // destination vertex id

val graphEdges = input.map {
    case Row(src: String, dst: String, weight: Long) => (
        if (vertexMap.contains(src)) {
            srcVid = vertexMap(src)
            if (vertexMap.contains(dst)) {
                dstVid = vertexMap(dst)
            } else {
                vid += 1   // pick a new vertex id
                vertexMap += (dst -> vid)
                dstVid = vid
            }
            Edge(srcVid, dstVid, weight)
        } else {
            vid += 1
            vertexMap(src) = vid
            srcVid = vid
            if (vertexMap.contains(dst)) {
                dstVid = vertexMap(dst)
            } else {
                vid += 1
                vertexMap(dst) = vid
                dstVid = vid
            }
            Edge(srcVid, dstVid, weight)
        }
    }

val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);
println("num vertices = " + graph.numVertices);

我看到的是

graphEdges的类型为RDD [org.apache.spark.graphx.Edge [Long]],图形的类型为Graph [Int,Long]

graphEdges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Long]] = MapPartitionsRDD[10] at map at <console>:64
graph: org.apache.spark.graphx.Graph[Int,Long] = org.apache.spark.graphx.impl.GraphImpl@1b48170a

但是在打印图形的边缘和顶点数时出现以下错误。

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 1 times, most recent failure: Lost task 1.0 in stage 8.0 (TID 9, localhost, executor driver): ***scala.MatchError: (vertexA, vertexN, 2000
)*** (of class scala.Tuple3)
        at $anonfun$1.apply(<console>:64)
        at $anonfun$1.apply(<console>:64)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

我不明白这里的不匹配在哪里。

感谢@Joe K提供了有用的提示。我开始使用zipIndex并且代码现在看起来很紧凑,但是图形实例化仍然失败。这是更新的代码:

val vertices = input.map(r => r._1).union(input.map(r => r._2)).distinct.zipWithIndex
val graphEdges = input.map {
    case (src, dst, weight) =>
        Edge(vertices.lookup(src)(0), vertices.lookup(dst)(0), weight)
}
val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);

所以,从最初的3元组开始,我形成了第1和第2元组(它们是顶点)的联合,然后在对它们进行无条件化后为每个元组分配唯一的ID。我正在使用他们的ID,同时创建边缘。但是,它失败并出现以下异常:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 23, localhost, executor driver): org.apache.spark.SparkException: This RDD lacks
 a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed
inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
        at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:89)
        at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:937)
        at $anonfun$1.apply(<console>:55)
        at $anonfun$1.apply(<console>:53)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
        at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)

有什么想法吗?

1 个答案:

答案 0 :(得分:1)

此特定错误来自于尝试将元组与Row匹配,而不是。{/ p>

变化:

case Row(src: String, dst: String, weight: Long) => {

只是:

case (src, dst, weight) => {

此外,您生成顶点ID的更大计划将无效。 map中的所有逻辑都将在不同的执行器中并行发生,这些执行器将具有可变映射的不同副本。

你应该flatMap得到所有顶点的列表,然后调用.distinct.zipWithIndex为每个顶点分配一个唯一的长值。然后,您需要重新连接原始边缘。