背景 - 目标 - 从任何提交Spark作业的计算机到Spark EC2集群的API调用 作业运行得非常好 - 在Localhost- Apache Spark上运行的Python文件 但是,无法在Apache Spark EC2上运行它。
澄清 -
Submitting jobs to Spark EC2 cluster remotely 是指远程向Spark EC2提交作业 - (但不是通过API调用)
API调用
curl -X POST http://ec2-54-209-108-127.compute-1.amazonaws.com:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "" ],
"appResource" : "wordcount.py",
"clientSparkVersion" : "1.5.0",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "",
"sparkProperties" : {
"spark.jars" : "wordcount.py",
"spark.driver.supervise" : "true",
"spark.app.name" : "MyJob",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://ec2-54-209-108-127.compute-1.amazonaws.com:6066"
}}'
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20160712145703-0003",
"serverSparkVersion" : "1.6.1",
"submissionId" : "driver-20160712145703-0003",
"success" : true
}
要获得响应,请在API返回错误后找不到文件
curl http://ec2-54-209-108-127.compute-1.amazonaws.com:6066/v1/submissions/status/driver-20160712145703-0003
{
"action" : "SubmissionStatusResponse",
"driverState" : "ERROR",
"message" : "Exception from the cluster:\njava.io.FileNotFoundException: wordcount.py (No such file or directory)\n\tjava.io.FileInputStream.open(Native Method)\n\tjava.io.FileInputStream.<init>(FileInputStream.java:146)\n\torg.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:124)\n\torg.spark-project.guava.io.Files$FileByteSource.openStream(Files.java:114)\n\torg.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:202)\n\torg.spark-project.guava.io.Files.copy(Files.java:436)\n\torg.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:539)\n\torg.apache.spark.util.Utils$.copyFile(Utils.scala:510)\n\torg.apache.spark.util.Utils$.doFetchFile(Utils.scala:595)\n\torg.apache.spark.util.Utils$.fetchFile(Utils.scala:394)\n\torg.apache.spark.deploy.worker.DriverRunner.org$apache$spark$deploy$worker$DriverRunner$$downloadUserJar(DriverRunner.scala:150)\n\torg.apache.spark.deploy.worker.DriverRunner$$anon$1.run(DriverRunner.scala:79)",
"serverSparkVersion" : "1.6.1",
"submissionId" : "driver-20160712145703-0003",
"success" : true,
"workerHostPort" : "172.31.17.189:59433",
"workerId" : "worker-20160712083825-172.31.17.189-59433"
}
等待建议和改进。 附: - Apache Spark中的新手..
更新API调用(将主类,appArgs,appResource,clientSparkVersion设置为更新值) - &gt;
curl -X POST http://ec2-54-209-108-127.compute-1.amazonaws.com:6066/v1/submissions/create{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "/wordcount.py" ],
"appResource" : "file:/wordcount.py",
"clientSparkVersion" : "1.6.1",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "org.apache.spark.deploy.SparkSubmit",
"sparkProperties" : {
"spark.driver.supervise" : "false",
"spark.app.name" : "Simple App",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://ec2-54-209-108-127.compute-1.amazonaws.com:6066"
}
}
答案 0 :(得分:0)
sudo /root/spark-ec2/copy-dir /root/wordcount.py
RSYNC'ing /root/wordcount.py to slaves...
ec2-54-175-163-32.compute-1.amazonaws.com
结果,File Not Found错误停止。但是,在重新提交Spark作业后,提交的状态为
{
"action": "SubmissionStatusResponse",
"driverState": "FAILED",
"serverSparkVersion": "1.6.1",
"submissionId": "driver-20160713094138-0010",
"success": true,
"workerHostPort": "172.31.17.189:59433",
"workerId": "worker-20160712083825-172.31.17.189-59433"
}
因此不确定确切的错误是什么,是否已经解决或没有,以及什么是新错误