我在Google Cloud Dataproc上创建了一个3节点(1个主节点,2个工人)的Apache Spark集群。通过ssh与主服务器连接时,我能够将作业提交到集群,但是我无法使其远程工作。除了在AWS上的similar issue之外,我找不到有关如何执行此操作的任何文档,但这对我不起作用。
这就是我正在尝试的
import pyspark
conf = pyspark.SparkConf().setAppName('Test').setMaster('spark://<master-node-ip>:7077')
sc = pyspark.SparkContext(conf=conf)
我得到了错误
19/11/13 13:33:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/11/13 13:33:53 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master <master-node-ip>:7077
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1$$anon$1.run(StandaloneAppClient.scala:106)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Failed to connect to /<master-node-ip>:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /<master-node-ip>:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
Caused by: java.net.ConnectException: Connection refused
我添加了防火墙规则以允许tcp:7077上的入口流量。但这并不能解决问题。
目前,我想在计算引擎上设置一个VM,该虚拟机可以运行此代码,同时通过内部ip地址(在我创建的VPC中)连接以在dataproc上运行作业,而无需使用gcloud dataproc jobs submit
。我在内部和外部IP上都尝试过,但是都没有用。
有人知道我如何使它工作吗?
答案 0 :(得分:2)
所以这里有一些东西要打开。
我想确保您了解的第一件事是,在将分布式计算框架暴露于入口流量时,您应该非常小心。如果Dataproc在端口7077上公开了Spark-Standalone集群,则需要确保锁定该入口流量。听起来好像您知道要在共享VPC上使用虚拟机,但这即使在测试是否打开防火墙时也很重要。
您似乎遇到的主要问题是,您似乎试图像Spark-Standalone cluster一样进行连接。 Dataproc实际上使用Spark on YARN。要进行连接,您需要将Spark Cluster Manager类型设置为“ yarn”,并通过配置arch:
- amd64
- ppc64le
- s390x
- arm64
并设置yarn-site.xml
点,将本地计算机正确配置为与远程YARN集群通信或通过HADOOP_CONF_DIR
直接设置yarn.resourcemanager.address
之类的YARN properties。
还请注意,一旦您知道Dataproc使用YARN:Scala Spark connect to remote cluster
,此问题与此类似。