火花提交的奇怪行为

时间:2015-07-01 21:13:31

标签: apache-spark hive cloudera-cdh apache-spark-sql

我在pyspark中运行以下代码:

In [14]: conf = SparkConf()

In [15]: conf.getAll()

[(u'spark.eventLog.enabled', u'true'),
 (u'spark.eventLog.dir',
  u'hdfs://ip-10-0-0-220.ec2.internal:8020/user/spark/applicationHistory'),
 (u'spark.master', u'local[*]'),
 (u'spark.yarn.historyServer.address',
  u'http://ip-10-0-0-220.ec2.internal:18088'),
 (u'spark.executor.extraLibraryPath',
  u'/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native'),
 (u'spark.app.name', u'pyspark-shell'),
 (u'spark.driver.extraLibraryPath',
  u'/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native')]

In [16]: sc

<pyspark.context.SparkContext at 0x7fab9dd8a750>

In [17]: sc.version

u'1.4.0'

In [19]: sqlContext

<pyspark.sql.context.HiveContext at 0x7fab9de785d0>

In [20]: access = sqlContext.read.json("hdfs://10.0.0.220/raw/logs/arquimedes/access/*.json")

一切顺利(我可以在Hive Metastore中创建表格等)

但是当我尝试使用spark-submit

运行此代码时
# -*- coding: utf-8 -*-                                                                                                                                                                                                                                                           

from __future__ import print_function

import re

from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import Row
from pyspark.conf import SparkConf

if __name__ == "__main__":

    sc = SparkContext(appName="Minimal Example 2")

    conf = SparkConf()

    print(conf.getAll())

    print(sc)

    print(sc.version)

    sqlContext = HiveContext(sc)

    print(sqlContext)

    # ## Read the access log file                                                                                                                                                                                                                                                 
    access = sqlContext.read.json("hdfs://10.0.0.220/raw/logs/arquimedes/access/*.json")

    sc.stop()

我用以下代码运行此代码:

$ spark-submit --master yarn-cluster  --deploy-mode cluster minimal-example2.py

并且运行没有错误(显然),但如果你检查日志:

$ yarn logs -applicationId application_1435696841856_0027       

读作:

15/07/01 16:55:10 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-220.ec2.internal/10.0.0.220:8032


Container: container_1435696841856_0027_01_000001 on ip-10-0-0-36.ec2.internal_8041
=====================================================================================
LogType: stderr
LogLength: 21077
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/yarn/nm/usercache/nanounanue/filecache/133/spark-assembly-1.4.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/07/01 16:54:00 INFO yarn.ApplicationMaster: Registered signal handlers for [TERM, HUP, INT]
15/07/01 16:54:01 INFO yarn.ApplicationMaster: ApplicationAttemptId: appattempt_1435696841856_0027_000001
15/07/01 16:54:02 INFO spark.SecurityManager: Changing view acls to: yarn,nanounanue
15/07/01 16:54:02 INFO spark.SecurityManager: Changing modify acls to: yarn,nanounanue
15/07/01 16:54:02 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, nanounanue); users with modify permissions: Set(yarn, nanounanue)
15/07/01 16:54:02 INFO yarn.ApplicationMaster: Starting the user application in a separate Thread
15/07/01 16:54:02 INFO yarn.ApplicationMaster: Waiting for spark context initialization
15/07/01 16:54:02 INFO yarn.ApplicationMaster: Waiting for spark context initialization ... 
15/07/01 16:54:03 INFO spark.SparkContext: Running Spark version 1.4.0
15/07/01 16:54:03 INFO spark.SecurityManager: Changing view acls to: yarn,nanounanue
15/07/01 16:54:03 INFO spark.SecurityManager: Changing modify acls to: yarn,nanounanue
15/07/01 16:54:03 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yarn, nanounanue); users with modify permissions: Set(yarn, nanounanue)
15/07/01 16:54:03 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/07/01 16:54:03 INFO Remoting: Starting remoting
15/07/01 16:54:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@10.0.0.36:41190]
15/07/01 16:54:03 INFO util.Utils: Successfully started service 'sparkDriver' on port 41190.
15/07/01 16:54:04 INFO spark.SparkEnv: Registering MapOutputTracker
15/07/01 16:54:04 INFO spark.SparkEnv: Registering BlockManagerMaster
15/07/01 16:54:04 INFO storage.DiskBlockManager: Created local directory at /yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/blockmgr-14127054-19b1-4cfe-80c3-2c5fc917c9cf
15/07/01 16:54:04 INFO storage.DiskBlockManager: Created local directory at /data0/yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/blockmgr-c8119846-7f6f-45eb-911b-443cb4d7e9c9
15/07/01 16:54:04 INFO storage.MemoryStore: MemoryStore started with capacity 245.7 MB
15/07/01 16:54:04 INFO spark.HttpFileServer: HTTP File server directory is /yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/httpd-c4abf72b-2ee4-45d7-8252-c68f925bef58
15/07/01 16:54:04 INFO spark.HttpServer: Starting HTTP Server
15/07/01 16:54:04 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/01 16:54:04 INFO server.AbstractConnector: Started SocketConnector@0.0.0.0:56437
15/07/01 16:54:04 INFO util.Utils: Successfully started service 'HTTP file server' on port 56437.
15/07/01 16:54:04 INFO spark.SparkEnv: Registering OutputCommitCoordinator
15/07/01 16:54:04 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/07/01 16:54:04 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/07/01 16:54:04 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:37958
15/07/01 16:54:04 INFO util.Utils: Successfully started service 'SparkUI' on port 37958.
15/07/01 16:54:04 INFO ui.SparkUI: Started SparkUI at http://10.0.0.36:37958
15/07/01 16:54:04 INFO cluster.YarnClusterScheduler: Created YarnClusterScheduler
15/07/01 16:54:04 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49759.
15/07/01 16:54:04 INFO netty.NettyBlockTransferService: Server created on 49759
15/07/01 16:54:05 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/07/01 16:54:05 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.0.0.36:49759 with 245.7 MB RAM, BlockManagerId(driver, 10.0.0.36, 49759)
15/07/01 16:54:05 INFO storage.BlockManagerMaster: Registered BlockManager
15/07/01 16:54:05 INFO scheduler.EventLoggingListener: Logging events to hdfs://ip-10-0-0-220.ec2.internal:8020/user/spark/applicationHistory/application_1435696841856_0027_1
15/07/01 16:54:05 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as AkkaRpcEndpointRef(Actor[akka://sparkDriver/user/YarnAM#-1566924249])
15/07/01 16:54:05 INFO client.RMProxy: Connecting to ResourceManager at ip-10-0-0-220.ec2.internal/10.0.0.220:8030
15/07/01 16:54:05 INFO yarn.YarnRMClient: Registering the ApplicationMaster
15/07/01 16:54:05 INFO yarn.YarnAllocator: Will request 2 executor containers, each with 1 cores and 1408 MB memory including 384 MB overhead
15/07/01 16:54:05 INFO yarn.YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)
15/07/01 16:54:05 INFO yarn.YarnAllocator: Container request (host: Any, capability: <memory:1408, vCores:1>)
15/07/01 16:54:05 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000
15/07/01 16:54:11 INFO impl.AMRMClientImpl: Received new token for : ip-10-0-0-99.ec2.internal:8041
15/07/01 16:54:11 INFO impl.AMRMClientImpl: Received new token for : ip-10-0-0-37.ec2.internal:8041
15/07/01 16:54:11 INFO yarn.YarnAllocator: Launching container container_1435696841856_0027_01_000002 for on host ip-10-0-0-99.ec2.internal
15/07/01 16:54:11 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: akka.tcp://sparkDriver@10.0.0.36:41190/user/CoarseGrainedScheduler,  executorHostname: ip-10-0-0-99.ec2.internal
15/07/01 16:54:11 INFO yarn.YarnAllocator: Launching container container_1435696841856_0027_01_000003 for on host ip-10-0-0-37.ec2.internal
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Starting Executor Container
15/07/01 16:54:11 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: akka.tcp://sparkDriver@10.0.0.36:41190/user/CoarseGrainedScheduler,  executorHostname: ip-10-0-0-37.ec2.internal
15/07/01 16:54:11 INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
15/07/01 16:54:11 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Starting Executor Container
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up ContainerLaunchContext
15/07/01 16:54:11 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up ContainerLaunchContext
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Preparing Local resources
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Preparing Local resources
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Prepared Local resources Map(__spark__.jar -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/spark-assembly-1.4.0-hadoop2.6.0.jar" } s
ize: 162896305 timestamp: 1435784032445 type: FILE visibility: PRIVATE, pyspark.zip -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/pyspark.zip" } size: 281333 timestamp: 1435784
032613 type: FILE visibility: PRIVATE, py4j-0.8.2.1-src.zip -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/py4j-0.8.2.1-src.zip" } size: 37562 timestamp: 1435784032652 type: FIL
E visibility: PRIVATE, minimal-example2.py -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/minimal-example2.py" } size: 2448 timestamp: 1435784032692 type: FILE visibility: PRIVA
TE)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Prepared Local resources Map(__spark__.jar -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/spark-assembly-1.4.0-hadoop2.6.0.jar" } s
ize: 162896305 timestamp: 1435784032445 type: FILE visibility: PRIVATE, pyspark.zip -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/pyspark.zip" } size: 281333 timestamp: 1435784
032613 type: FILE visibility: PRIVATE, py4j-0.8.2.1-src.zip -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/py4j-0.8.2.1-src.zip" } size: 37562 timestamp: 1435784032652 type: FIL
E visibility: PRIVATE, minimal-example2.py -> resource { scheme: "hdfs" host: "ip-10-0-0-220.ec2.internal" port: 8020 file: "/user/nanounanue/.sparkStaging/application_1435696841856_0027/minimal-example2.py" } size: 2448 timestamp: 1435784032692 type: FILE visibility: PRIVA
TE)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up executor with environment: Map(CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>$HADOOP_CLIENT_CONF_DIR<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOO
P_HDFS_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$MR2_CLASSPATH, SPARK_LOG_URL_STDERR -> http://ip-10-0-0-37.ec2.internal:8042/node/containerlogs/container_1435696841856_0027_01_000003/nanounan
ue/stderr?start=0, SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1435696841856_0027, SPARK_YARN_CACHE_FILES_FILE_SIZES -> 162896305,281333,37562,2448, SPARK_USER -> nanounanue, SPARK_YARN_CACHE_FILES_VISIBILITIES -> PRIVATE,PRIVATE,PRIVATE,PRIVATE, SPARK_YARN_MODE -> 
true, SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1435784032445,1435784032613,1435784032652,1435784032692, PYTHONPATH -> pyspark.zip:py4j-0.8.2.1-src.zip, SPARK_LOG_URL_STDOUT -> http://ip-10-0-0-37.ec2.internal:8042/node/containerlogs/container_1435696841856_0027_01_000003/nanou
nanue/stdout?start=0, SPARK_YARN_CACHE_FILES -> hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_1435696841856_0027/spark-assembly-1.4.0-hadoop2.6.0.jar#__spark__.jar,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/applic
ation_1435696841856_0027/pyspark.zip#pyspark.zip,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_1435696841856_0027/py4j-0.8.2.1-src.zip#py4j-0.8.2.1-src.zip,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_14
35696841856_0027/minimal-example2.py#minimal-example2.py)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up executor with environment: Map(CLASSPATH -> {{PWD}}<CPS>{{PWD}}/__spark__.jar<CPS>$HADOOP_CLIENT_CONF_DIR<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOO
P_HDFS_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$MR2_CLASSPATH, SPARK_LOG_URL_STDERR -> http://ip-10-0-0-99.ec2.internal:8042/node/containerlogs/container_1435696841856_0027_01_000002/nanounan
ue/stderr?start=0, SPARK_YARN_STAGING_DIR -> .sparkStaging/application_1435696841856_0027, SPARK_YARN_CACHE_FILES_FILE_SIZES -> 162896305,281333,37562,2448, SPARK_USER -> nanounanue, SPARK_YARN_CACHE_FILES_VISIBILITIES -> PRIVATE,PRIVATE,PRIVATE,PRIVATE, SPARK_YARN_MODE -> 
true, SPARK_YARN_CACHE_FILES_TIME_STAMPS -> 1435784032445,1435784032613,1435784032652,1435784032692, PYTHONPATH -> pyspark.zip:py4j-0.8.2.1-src.zip, SPARK_LOG_URL_STDOUT -> http://ip-10-0-0-99.ec2.internal:8042/node/containerlogs/container_1435696841856_0027_01_000002/nanou
nanue/stdout?start=0, SPARK_YARN_CACHE_FILES -> hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_1435696841856_0027/spark-assembly-1.4.0-hadoop2.6.0.jar#__spark__.jar,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/applic
ation_1435696841856_0027/pyspark.zip#pyspark.zip,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_1435696841856_0027/py4j-0.8.2.1-src.zip#py4j-0.8.2.1-src.zip,hdfs://ip-10-0-0-220.ec2.internal:8020/user/nanounanue/.sparkStaging/application_14
35696841856_0027/minimal-example2.py#minimal-example2.py)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up executor with commands: List(LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native:$LD_LIBRARY_PATH", {{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m, -Xmx
1024m, -Djava.io.tmpdir={{PWD}}/tmp, '-Dspark.ui.port=0', '-Dspark.driver.port=41190', -Dspark.yarn.app.container.log.dir=<LOG_DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend, --driver-url, akka.tcp://sparkDriver@10.0.0.36:41190/user/CoarseGrainedScheduler, --e
xecutor-id, 1, --hostname, ip-10-0-0-99.ec2.internal, --cores, 1, --app-id, application_1435696841856_0027, --user-class-path, file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
15/07/01 16:54:11 INFO yarn.ExecutorRunnable: Setting up executor with commands: List(LD_LIBRARY_PATH="/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native:$LD_LIBRARY_PATH", {{JAVA_HOME}}/bin/java, -server, -XX:OnOutOfMemoryError='kill %p', -Xms1024m, -Xmx
1024m, -Djava.io.tmpdir={{PWD}}/tmp, '-Dspark.ui.port=0', '-Dspark.driver.port=41190', -Dspark.yarn.app.container.log.dir=<LOG_DIR>, org.apache.spark.executor.CoarseGrainedExecutorBackend, --driver-url, akka.tcp://sparkDriver@10.0.0.36:41190/user/CoarseGrainedScheduler, --e
xecutor-id, 2, --hostname, ip-10-0-0-37.ec2.internal, --cores, 1, --app-id, application_1435696841856_0027, --user-class-path, file:$PWD/__app__.jar, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr)
15/07/01 16:54:11 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-10-0-0-37.ec2.internal:8041
15/07/01 16:54:14 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-99.ec2.internal:43176
15/07/01 16:54:15 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-37.ec2.internal:58472
15/07/01 16:54:15 INFO cluster.YarnClusterSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@ip-10-0-0-99.ec2.internal:49047/user/Executor#563862009]) with ID 1
15/07/01 16:54:15 INFO cluster.YarnClusterSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@ip-10-0-0-37.ec2.internal:36122/user/Executor#1370723906]) with ID 2
15/07/01 16:54:15 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/07/01 16:54:15 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
15/07/01 16:54:15 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-10-0-0-99.ec2.internal:59769 with 530.3 MB RAM, BlockManagerId(1, ip-10-0-0-99.ec2.internal, 59769)
15/07/01 16:54:16 INFO storage.BlockManagerMasterEndpoint: Registering block manager ip-10-0-0-37.ec2.internal:48859 with 530.3 MB RAM, BlockManagerId(2, ip-10-0-0-37.ec2.internal, 48859)
15/07/01 16:54:16 INFO hive.HiveContext: Initializing execution hive, version 0.13.1
15/07/01 16:54:17 INFO metastore.HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/07/01 16:54:17 INFO metastore.ObjectStore: ObjectStore, initialize called
15/07/01 16:54:17 INFO spark.SparkContext: Invoking stop() from shutdown hook
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
15/07/01 16:54:17 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
15/07/01 16:54:17 INFO ui.SparkUI: Stopped Spark web UI at http://10.0.0.36:37958
15/07/01 16:54:17 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/01 16:54:17 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
15/07/01 16:54:17 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down
15/07/01 16:54:17 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-99.ec2.internal:49047
15/07/01 16:54:17 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-37.ec2.internal:36122
15/07/01 16:54:17 INFO ui.SparkUI: Stopped Spark web UI at http://10.0.0.36:37958
15/07/01 16:54:17 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/07/01 16:54:17 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
15/07/01 16:54:17 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down
15/07/01 16:54:17 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-99.ec2.internal:49047
15/07/01 16:54:17 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. ip-10-0-0-37.ec2.internal:36122
15/07/01 16:54:17 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/07/01 16:54:17 INFO storage.MemoryStore: MemoryStore cleared
15/07/01 16:54:17 INFO storage.BlockManager: BlockManager stopped
15/07/01 16:54:17 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
15/07/01 16:54:17 INFO spark.SparkContext: Successfully stopped SparkContext
15/07/01 16:54:17 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/07/01 16:54:17 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/07/01 16:54:17 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/07/01 16:54:17 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status was reported.)
15/07/01 16:54:17 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.)
15/07/01 16:54:17 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
15/07/01 16:54:17 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/07/01 16:54:17 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1435696841856_0027
15/07/01 16:54:17 INFO util.Utils: Shutdown hook called
15/07/01 16:54:17 INFO util.Utils: Deleting directory /yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/pyspark-215f5c19-b1cb-47df-ad43-79da4244de61
15/07/01 16:54:17 INFO util.Utils: Deleting directory /yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/container_1435696841856_0027_01_000001/tmp/spark-c96dc9dc-e6ee-451b-b09e-637f5d4ca990

LogType: stdout
LogLength: 2404
Log Contents:
[(u'spark.eventLog.enabled', u'true'), (u'spark.submit.pyArchives', u'pyspark.zip:py4j-0.8.2.1-src.zip'), (u'spark.yarn.app.container.log.dir', u'/var/log/hadoop-yarn/container/application_1435696841856_0027/container_1435696841856_0027_01_000001'), (u'spark.eventLog.dir', 
u'hdfs://ip-10-0-0-220.ec2.internal:8020/user/spark/applicationHistory'), (u'spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS', u'ip-10-0-0-220.ec2.internal'), (u'spark.yarn.historyServer.address', u'http://ip-10-0-0-220.ec2.internal:18088'
), (u'spark.ui.port', u'0'), (u'spark.yarn.app.id', u'application_1435696841856_0027'), (u'spark.app.name', u'minimal-example2.py'), (u'spark.executor.instances', u'2'), (u'spark.executorEnv.PYTHONPATH', u'pyspark.zip:py4j-0.8.2.1-src.zip'), (u'spark.submit.pyFiles', u''), 
(u'spark.executor.extraLibraryPath', u'/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native'), (u'spark.master', u'yarn-cluster'), (u'spark.ui.filters', u'org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter'), (u'spark.org.apache.hadoop.yarn.server.w
ebproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES', u'http://ip-10-0-0-220.ec2.internal:8088/proxy/application_1435696841856_0027'), (u'spark.driver.extraLibraryPath', u'/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/lib/hadoop/lib/native'), (u'spark.yarn.app.attemptId', u
'1')]
<pyspark.context.SparkContext object at 0x3fd53d0>
1.4.0
<pyspark.sql.context.HiveContext object at 0x40a9110>
Traceback (most recent call last):
  File "minimal-example2.py", line 53, in <module>
    access = sqlContext.read.json("hdfs://10.0.0.220/raw/logs/arquimedes/access/*.json")
  File "/yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/container_1435696841856_0027_01_000001/pyspark.zip/pyspark/sql/context.py", line 591, in read
  File "/yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/container_1435696841856_0027_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 39, in __init__
  File "/yarn/nm/usercache/nanounanue/appcache/application_1435696841856_0027/container_1435696841856_0027_01_000001/pyspark.zip/pyspark/sql/context.py", line 619, in _ssql_ctx
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o53))

重要的一方是最后一行:"You must build Spark with Hive."为什么?我做错了什么?

3 个答案:

答案 0 :(得分:6)

我最近遇到了同样的问题。但事实证明,来自Spark的消息具有误导性;没有丢失的罐子。对我来说问题是由PySpark调用的Java类HiveContext在构造时解析hive-site.xml并且在构造期间引发了异常。 (PySpark捕获了这个异常并且错误地暗示它是由于丢失的jar。)它最终是属性hive.metastore.client.connect.retry.delay的错误,它被设置为2sHiveContext类尝试将其解析为整数,但失败。将其更改为2并删除hive.metastore.client.socket.timeouthive.metastore.client.socket.lifetime中的字符。

请注意,您可以直接致电sqlContext._get_hive_ctx()来获得更具描述性的错误。

答案 1 :(得分:2)

您应该创建HiveContext的SQLContext instand

from pyspark.sql import  SQLContext
sqlContext=SQLContext(sc)

答案 2 :(得分:-1)

它还说: &#39;调用None.org.apache.spark.sql.hive.HiveContext时发生错误。\ n&#39;

因此,问题似乎是spark-submit命令中没有提供Hive部分,并且集群无法找到Hive依赖项。 就像它说的那样,并且:

Export 'SPARK_HIVE=true'

理论上,它应该允许你构建包含Hive依赖的jar,这样spark就会找到它错过的lib。