我使用Spark版本1.3.1
作为群集。我想在scala spark上下文中将数据传输到python脚本中。我找到了一些可以做的事情,即在spark上下文中添加python文件。这适用于独立的spark模式。但它不适用于集群。
应该将该python文件添加到其他节点以及master中。但事实并非如此。我检查了tmp
文件夹以查看添加的文件。只有主人有该文件。这有什么问题?它是否有缺陷或是否有我错过的东西。
scala> import org.apache.spark.SparkFiles
scala> var script_file = "/home/hdfs/my_script.py"
scala> var script_name = "my_script 1"
scala>
scala> sc.addFile(script_file)
16/09/26 07:04:39 INFO Utils: Copying /home/hdfs/my_script.py to /tmp/spark-d4dd61c2-f422-4846-8486-5705ee54e320/userFiles-c8a810d7-3ef8-43bc-a346-4822c8b817e7/my_script.py
16/09/26 07:04:39 INFO SparkContext: Added file /home/hdfs/my_script.py at http://x.x.x.x:39017/files/my_script.py with timestamp 1474873479690
scala>
scala> val ipData = sc.parallelize(List("test data 1","test data 2"))
scala> val opData = ipData.pipe(SparkFiles.get(script_name))
scala> opData.foreach(println)
输出说其他节点上没有这样的文件;
org.apache.spark.SparkException: Job aborted due to stage failure: Task 30 in stage 0.0 failed 4 times, most recent failure: Lost task 30.3 in stage 0.0 (TID 60, my_other_node_1): java.io.IOException: Cannot run program "/tmp/spark-d4dd61c2-f422-4846-8486-5705ee54e320/userFiles-c8a810d7-3ef8-43bc-a346-4822c8b817e7/my_script.py": error=2, No such file or directory