我试图从我的工作站(在Intellij内部)运行一个应用程序,并连接到在ec2上运行的远程Spark集群(2.3.1)。我知道这不是最佳做法,但是如果我可以将其用于开发工作,那将使我的生活更加轻松。
我已经设法走得很远,并且能够在RDD上运行操作并返回结果,直到进入使用.zipWithIndex()的步骤并得到以下异常:
vendor.js
其中172.x.x.x是包含主服务器和辅助服务器的spark实例的AWS VPC内(经过审查的)本地IP。我已经配置了ec2 Spark实例,以便它应与ERROR 2018-07-19 11:16:21,137 o.a.spark.network.shuffle.RetryingBlockFetcher Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /172.x.x.x:33898
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) ~[spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) ~[spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:113) ~[spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:123) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:98) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:691) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) [spark-combined-shaded-2.3.1-evg1.jar:na]
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) [spark-combined-shaded-2.3.1-evg1.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_172]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_172]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_172]
一起使用它的公共DNS,并使用以下配置来构建我的SparkContext:
SPARK_PUBLIC_DNS
然后我用建立一个SSH隧道
SparkConf sparkConf = new SparkConf()
.setAppName("myapp")
.setMaster(System.getProperty("spark.master", "spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077"))
.set("spark.cores.max", String.valueOf(4))
.set("spark.scheduler.mode", "FAIR")
.set("spark.driver.maxResultSize", String.valueOf(maxResultSize))
.set("spark.executor.memory", "2G")
.set("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
.set("spark.ui.retainedStages", String.valueOf(250))
.set("spark.ui.retainedJobs", String.valueOf(250))
.set("spark.network.timeout", String.valueOf(800))
.set("spark.driver.host", "localhost")
.set("spark.driver.port", String.valueOf(23584))
.set("spark.driver.blockManager.port", String.valueOf(6578))
.set("spark.files.overwrite", "true")
;
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
jsc.addJar("my_application.jar");
,以便工人可以看到我的机器。我想念什么?为什么仍然尝试通过无法在我的机器上看到的AWS IP连接到某物?
编辑:当我查看Web UI时,可以看到ssh -R 23584:localhost:23584 -L 44895:localhost:44895 -R 27017:localhost:27017 -R 6578:localhost:6578 ubuntu@ec2-x-x-x-x.compute-1.amazonaws.com
中引用的端口确实属于执行程序。如何告诉我的驱动程序通过公用IP而不是专用IP进行连接?
答案 0 :(得分:0)
我最终能够通过在spark-env.sh文件中将未记录的变量SPARK_LOCAL_HOSTNAME设置为公共DNS来解决此问题。
我的环境配置现在看起来像这样:
export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export SPARK_PUBLIC_DNS="ec2-xx-xxx-xxx-x.compute-1.amazonaws.com"
export SPARK_MASTER_HOST=""
export SPARK_LOCAL_HOSTNAME="ec2-xx-xxx-xxx-x.compute-1.amazonaws.com"