Spark在群集模式下使用地图

时间:2016-09-05 10:52:31

标签: dictionary apache-spark cluster-computing broadcast

我班上有一张不可变的地图。当我在本地模式下运行我的代码时,没有问题,我可以到达地图中的每个键。但是,当我在群集模式下运行代码时,节点会抛出有关未在地图中找到密钥的错误。

我现在所做的就是这些;

- 通过群集广播不可变地图。

broadcast = sc.broadcast(my_immutable_map)

- 将地图并行化为对RDD

my_map_rdd = sc.parallelize( my_immutable_map.toSeq) 

当我检查日志时,我看到key not found异常。 我的错误堆栈跟踪如下:

Driver stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 15.0 failed 4 times, most recent failure: Lost task 1.3 in stage 15.0 (TID 25, datanode1.big.com): java.util.NoSuchElementException: key not found: 905053199731
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:58)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:58)
    at havelsan.CDRGenerator$.generate_random_target(CDRGenerator.scala:95)
    at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:167)
    at havelsan.CDRGenerator$$anonfun$main$2$$anonfun$6.apply(CDRGenerator.scala:165)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1197)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1197)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1205)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

你能解释一下spark如何分配地图以及某些节点在这张地图中找不到某些键的可能性吗?顺便说一句,我的火花版本是1.6.0

我错过了什么?

更新

这部分用于初始化驱动程序上的地图。

...
    var pd = sc.textFile( "hdfs://...")
    my_immutable_map = pd.map( line => line.split(":") ).map{ line => (line(0), line(1).split(","))}.collectAsMap
... 

    broadcast = sc.broadcast(my_immutable_map)
    my_map_rdd = sc.parallelize( my_immutable_map.toSeq) 

这是我收到错误的部分。

def my_func(key:String):String={
...
    my_value = broadcast.value(key)
...
}

my_func在地图中被调用为;

my_another_rdd.map{ line => 
val key = line.split(",")(0)
   my_func(key)
 }

1 个答案:

答案 0 :(得分:0)

我找到的解决方案是将广播值作为参数传递给函数。尽管如此,我还是找不到并行化方法的解决方案。

https://stackoverflow.com/a/34912887/4668959