我无法让我的程序在我的spark群集上运行。我用1个主服务器和4个从服务器设置了群集。我启动了大师,之后,我开始了奴隶,他们出现在主人的网络上。
然后我启动一个小的python脚本来检查是否可以执行作业:
from pyspark import * #SparkContext, SparkConf, spark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql import SQLContext
from files import files
import sys
if __name__ == "__main__":
appName = 'SparkExample'
masterUrl = 'spark://10.0.2.55:7077'
conf = SparkConf()
conf.setAppName(appName)
conf.setMaster(masterUrl)
conf.set("spark.driver.cores","1")
conf.set("spark.driver.memory","1g")
conf.set("spark.executor.cores","1")
conf.set("spark.executor.memory","4g")
conf.set("spark.python.worker.memory","256m")
conf.set("spark.cores.max","4")
conf.set("spark.shuffle.service.enabled","true")
conf.set("spark.dynamicAllocation.enabled","true")
conf.set("spark.dynamicAllocation.maxExecutors","1")
for k,v in conf.getAll():
print(k+":"+v)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
#spark = SparkSession.builder.master(masterUrl).appName(appName).config("spark.executor.memory","1g").getOrCreate()
l = [('Alice', 1)]
spark.createDataFrame(l).collect()
spark.createDataFrame(l, ['name', 'age']).collect()
print("#############")
print("Test finished")
print("#############")
但是一旦我得回来(第45行:“spark.createDataFrame(l).collect()”),火花似乎就会挂断。过了一会儿,我看到了这样的信息:
“WARN TaskSchedulerImpl:初始作业未接受任何资源:检查您的集群UI以确保工作人员已注册且资源充足”
所以我检查了集群UI:
worker-20171027105227-xx.x.x.x6-35309 10.0.2.56:35309 ALIVE 4 (0 Used) 6.8 GB (0.0 B Used)
worker-20171027110202-xx.x.x.x0-43433 10.0.2.10:43433 ALIVE 16 (1 Used) 30.4 GB (4.0 GB Used)
worker-20171027110746-xx.x.x.x5-45126 10.0.2.65:45126 ALIVE 8 (0 Used) 30.4 GB (0.0 B Used)
worker-20171027110939-xx.x.x.x4-42477 10.0.2.64:42477 ALIVE 16 (0 Used) 30.4 GB (0.0 B Used)
看起来我创建的小任务有足够的资源。我也看到任务实际在那里运行。当我点击它时,我看到它是在5个执行器上启动的,除了一个EXITED之外的所有执行器。当我打开其中一个退出的日志时,我看到以下错误消息:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/10/27 16:45:23 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 14443@CODA
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for TERM
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for HUP
17/10/27 16:45:23 INFO SignalUtils: Registered signal handler for INT
17/10/27 16:45:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/10/27 16:45:24 INFO SecurityManager: Changing view acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls to: root,geissler
17/10/27 16:45:24 INFO SecurityManager: Changing view acls groups to:
17/10/27 16:45:24 INFO SecurityManager: Changing modify acls groups to:
17/10/27 16:45:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, geissler); groups with view permissions: Set(); users with modify permissions: Set(root, geissler); groups with modify permissions: Set()
17/10/27 16:47:25 ERROR RpcOutboxMessage: Ask timeout before connecting successfully
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Failure.recover(Try.scala:216)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.complete(Promise.scala:55)
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:153)
at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:205)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:239)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply in 120 seconds
... 8 more
这看起来好像奴隶无法将结果反馈给我。但我不知道该做什么。从属服务器与主服务器位于网络的同一层,但位于不同的虚拟机(不是docker容器)上。有没有办法检查,如果他们能够/不能到达主服务器?设置群集时是否有任何我忽略的配置设置?
Spark版本:2.1.2(在master,nodes和pyspark上)
答案 0 :(得分:0)
这里的错误是,python脚本是在本地执行的。始终通过spark-submit启动您的spark脚本,而不是仅仅将其作为普通程序运行。 Java spark程序也是如此。