在ec2中的Spark群集上运行zipWithIndex()失败,并显示“网络无法访问”

时间:2018-07-19 15:30:29

标签: amazon-web-services apache-spark amazon-ec2

我试图从我的工作站(在Intellij内部)运行一个应用程序,并连接到在ec2上运行的远程Spark集群(2.3.1)。我知道这不是最佳做法,但是如果我可以将其用于开发工作,那将使我的生活更加轻松。

我已经设法走得很远,并且能够在RDD上运行操作并返回结果,直到进入使用.zipWithIndex()的步骤并得到以下异常:

vendor.js

其中172.x.x.x是包含主服务器和辅助服务器的spark实例的AWS VPC内(经过审查的)本地IP。我已经配置了ec2 Spark实例,以便它应与ERROR 2018-07-19 11:16:21,137 o.a.spark.network.shuffle.RetryingBlockFetcher Exception while beginning fetch of 1 outstanding blocks java.io.IOException: Failed to connect to /172.x.x.x:33898 at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) ~[spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) ~[spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:113) ~[spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) [spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121) [spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:123) [spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:98) [spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:691) [spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82) [spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) [spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63) [spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) [spark-combined-shaded-2.3.1-evg1.jar:na] at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62) [spark-combined-shaded-2.3.1-evg1.jar:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_172] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_172] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_172] 一起使用它的公共DNS,并使用以下配置来构建我的SparkContext:

SPARK_PUBLIC_DNS

然后我用建立一个SSH隧道

SparkConf sparkConf = new SparkConf()
              .setAppName("myapp")
              .setMaster(System.getProperty("spark.master", "spark://ec2-x-x-x-x.compute-1.amazonaws.com:7077"))
              .set("spark.cores.max", String.valueOf(4))
              .set("spark.scheduler.mode", "FAIR")
              .set("spark.driver.maxResultSize", String.valueOf(maxResultSize))
              .set("spark.executor.memory", "2G")
              .set("spark.executor.extraJavaOptions", "-XX:+UseG1GC")
              .set("spark.ui.retainedStages", String.valueOf(250))
              .set("spark.ui.retainedJobs", String.valueOf(250))
              .set("spark.network.timeout", String.valueOf(800))
              .set("spark.driver.host", "localhost")
              .set("spark.driver.port", String.valueOf(23584))
              .set("spark.driver.blockManager.port", String.valueOf(6578))
              .set("spark.files.overwrite", "true")
              ;
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
jsc.addJar("my_application.jar");

,以便工人可以看到我的机器。我想念什么?为什么仍然尝试通过无法在我的机器上看到的AWS IP连接到某物?

编辑:当我查看Web UI时,可以看到ssh -R 23584:localhost:23584 -L 44895:localhost:44895 -R 27017:localhost:27017 -R 6578:localhost:6578 ubuntu@ec2-x-x-x-x.compute-1.amazonaws.com 中引用的端口确实属于执行程序。如何告诉我的驱动程序通过公用IP而不是专用IP进行连接?

1 个答案:

答案 0 :(得分:0)

我最终能够通过在spark-env.sh文件中将未记录的变量SPARK_LOCAL_HOSTNAME设置为公共DNS来解决此问题。

我的环境配置现在看起来像这样:

export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export SPARK_PUBLIC_DNS="ec2-xx-xxx-xxx-x.compute-1.amazonaws.com"
export SPARK_MASTER_HOST=""
export SPARK_LOCAL_HOSTNAME="ec2-xx-xxx-xxx-x.compute-1.amazonaws.com"