PySpark应用程序在客户端模式下永久挂起

时间:2019-07-16 10:53:53

标签: python apache-spark pyspark cluster-computing

当您使用spark-submit在本地运行脚本时,该脚本运行良好,当前有4个从属节点,并且当您尝试按以下方式在客户端模式下将pyspark脚本作为应用程序运行时

./bin/spark-submit \
--master spark://<master ip>:7077 \
--deploy-mode  client \
--files /home/scalability_scripts/amazon_reviews/amazon_review_full_csv/train.csv  \
/home/scalability_scripts/NaiveBayes.py --train /home/scalability_scripts/amazon_reviews/amazon_review_full_csv/train.csv --limit 1000000

代码永远挂起,这是最后显示的消息

9/07/16 03:42:12 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
19/07/16 03:42:12 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/spark-2.4.3-bin-hadoop2.7/spark-warehouse').
19/07/16 03:42:12 INFO SharedState: Warehouse path is 'file:/home/spark-2.4.3-bin-hadoop2.7/spark-warehouse'.
19/07/16 03:42:12 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint

集群管理器是spark独立集群管理器,所有从属节点都安装了带有必要模块的python。这个问题与指定文件有关系吗?还是我想念其他任何东西 配置?

p.s我在应用程序挂起时在spark ui的stderr中发现以下错误

Spark Executor Command: "/usr/lib/jvm/jdk1.8.0_211//bin/java" "-cp" "/home/spark-2.4.3-bin-hadoop2.7/conf/:/home/spark-2.4.3-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.driver.port=37281" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@server:37281" "--executor-id" "5142" "--hostname" "153.64.25.60" "--cores" "8" "--app-id" "app-20190716015703-0000" "--worker-url" "spark://Worker@153.64.25.60:32921"
========================================

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/07/16 02:33:06 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 6970@39054
19/07/16 02:33:06 INFO SignalUtils: Registered signal handler for TERM
19/07/16 02:33:06 INFO SignalUtils: Registered signal handler for HUP
19/07/16 02:33:06 INFO SignalUtils: Registered signal handler for INT
19/07/16 02:33:06 WARN Utils: Your hostname, 39054 resolves to a loopback address: 127.0.1.1; using 153.64.25.60 instead (on interface ens160)
19/07/16 02:33:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
19/07/16 02:33:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/07/16 02:33:06 INFO SecurityManager: Changing view acls to: root
19/07/16 02:33:06 INFO SecurityManager: Changing modify acls to: root
19/07/16 02:33:06 INFO SecurityManager: Changing view acls groups to: 
19/07/16 02:33:06 INFO SecurityManager: Changing modify acls groups to: 
19/07/16 02:33:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: 
    at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
    at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
    at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
    ... 4 more
Caused by: java.io.IOException: Failed to connect to server/153.64.25.113:37281
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
    at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
    at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
    at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: server/153.64.25.113:37281
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
    at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
    at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
    at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
    ... 1 more
Caused by: java.net.ConnectException: Connection refused
    ... 11 more

0 个答案:

没有答案