在mesos上运行的spark1.6.0无法获取Hadoop集群名称

时间:2016-01-26 07:31:27

标签: hadoop apache-spark mesos

我在Mesos + Hadoop上使用Spark运行简单代码时收到UnknownHostException。当我运行第一次运行时,作业将显示UnknownHostException并且作业失败,但是当我运行第二次运行时,作业将完成。

Spark1.6.0 + YARN + Hadoop上的测试代码没问题。

测试代码

file = sc.textFile("hdfs://cluster1/user/root/readme.txt")
file.count()

当第一次运行失败时(无法获得Hadoop集群名称),第二次运行将成功,找到发生的事情......

 root@hadoopnn1 /opt/spark-1.6.0 > ./bin/pyspark --master mesos://zk://hadoopnn1.nogle.com:2181,hadoopnn2.nogle.com:2181,hadoopslave1.nogle.com:2181/mesos
Python 3.5.1 (default, Dec  8 2015, 10:40:49) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/spark-1.6.0/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/01/26 11:52:36 INFO spark.SparkContext: Running Spark version 1.6.0
I0126 11:52:36.671980 10758 slave.cpp:3896] Framework 53fdde51-729b-4aa0-b0b1-2fc93b59de61-0002 seems to have exited. Ignoring shutdown timeout for executor '0'
16/01/26 11:52:36 INFO spark.SecurityManager: Changing view acls to: root
16/01/26 11:52:36 INFO spark.SecurityManager: Changing modify acls to: root
16/01/26 11:52:36 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/01/26 11:52:37 INFO util.Utils: Successfully started service 'sparkDriver' on port 37732.
16/01/26 11:52:37 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/01/26 11:52:37 INFO Remoting: Starting remoting
16/01/26 11:52:37 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@10.1.30.112:45752]
16/01/26 11:52:37 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 45752.
16/01/26 11:52:37 INFO spark.SparkEnv: Registering MapOutputTracker
16/01/26 11:52:37 INFO spark.SparkEnv: Registering BlockManagerMaster
16/01/26 11:52:37 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-46550571-e2c2-4c88-b0ce-78fc159fd5d8
16/01/26 11:52:37 INFO storage.MemoryStore: MemoryStore started with capacity 511.5 MB
16/01/26 11:52:37 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/01/26 11:52:37 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/01/26 11:52:37 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/01/26 11:52:37 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/01/26 11:52:37 INFO ui.SparkUI: Started SparkUI at http://10.1.30.112:4040
2016-01-26 11:52:37,964:11756(0x7fbce21fc700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2016-01-26 11:52:37,964:11756(0x7fbce21fc700):ZOO_INFO@log_env@716: Client environment:host.name=hadoopnn1
2016-01-26 11:52:37,964:11756(0x7fbce21fc700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2016-01-26 11:52:37,964:11756(0x7fbce21fc700):ZOO_INFO@log_env@724: Client environment:os.arch=2.6.32-504.el6.x86_64
2016-01-26 11:52:37,964:11756(0x7fbce21fc700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Wed Oct 15 04:27:16 UTC 2014
2016-01-26 11:52:37,964:11756(0x7fbce21fc700):ZOO_INFO@log_env@733: Client environment:user.name=root
2016-01-26 11:52:37,964:11756(0x7fbce21fc700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2016-01-26 11:52:37,964:11756(0x7fbce21fc700):ZOO_INFO@log_env@753: Client environment:user.dir=/opt/spark-1.6.0
2016-01-26 11:52:37,964:11756(0x7fbce21fc700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=hadoopnn1.nogle.com:2181,hadoopnn2.nogle.com:2181,hadoopslave1.nogle.com:2181 sessionTimeout=10000 watcher=0x32b1d77c40 sessionId=0 sessionPasswd=<null> context=0x7fbd4c0017a0 flags=0
I0126 11:52:37.964711 11846 sched.cpp:166] Version: 0.26.0
2016-01-26 11:52:37,966:11756(0x7fbcde5f6700):ZOO_INFO@check_events@1703: initiated connection to server [10.1.30.113:2181]
2016-01-26 11:52:38,165:11756(0x7fbcde5f6700):ZOO_INFO@check_events@1750: session establishment complete on server [10.1.30.113:2181], sessionId=0x252529a8a57001e, negotiated timeout=10000
I0126 11:52:38.166128 11837 group.cpp:331] Group process (group(1)@10.1.30.112:57650) connected to ZooKeeper
I0126 11:52:38.166153 11837 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0126 11:52:38.166162 11837 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I0126 11:52:38.166968 11839 detector.cpp:156] Detected a new leader: (id='44')
I0126 11:52:38.167044 11842 group.cpp:674] Trying to get '/mesos/json.info_0000000044' in ZooKeeper
I0126 11:52:38.167554 11840 detector.cpp:482] A new leading master (UPID=master@10.1.30.112:5050) is detected
I0126 11:52:38.167595 11837 sched.cpp:264] New master detected at master@10.1.30.112:5050
I0126 11:52:38.167809 11837 sched.cpp:274] No credentials provided. Attempting to register without authentication
I0126 11:52:38.168429 11840 sched.cpp:643] Framework registered with 53fdde51-729b-4aa0-b0b1-2fc93b59de61-0003
16/01/26 11:52:38 INFO mesos.CoarseMesosSchedulerBackend: Registered as framework ID 53fdde51-729b-4aa0-b0b1-2fc93b59de61-0003
16/01/26 11:52:38 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35540.
16/01/26 11:52:38 INFO netty.NettyBlockTransferService: Server created on 35540
16/01/26 11:52:38 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/01/26 11:52:38 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.1.30.112:35540 with 511.5 MB RAM, BlockManagerId(driver, 10.1.30.112, 35540)
16/01/26 11:52:38 INFO storage.BlockManagerMaster: Registered BlockManager
I0126 11:52:38.215968 10751 slave.cpp:1294] Got assigned task 0 for framework 53fdde51-729b-4aa0-b0b1-2fc93b59de61-0003
I0126 11:52:38.216191 10751 slave.cpp:1410] Launching task 0 for framework 53fdde51-729b-4aa0-b0b1-2fc93b59de61-0003

Welcome to
    ____              __
    / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
/__ / .__/\_,_/_/ /_/\_\   version 1.6.0
    /_/

Using Python version 3.5.1 (default, Dec  8 2015 10:40:49)
SparkContext available as sc, HiveContext available as sqlContext.
>>> file = sc.textFile("hdfs://hadoopcluster1/user/root/readme.txt")
file.count()16/01/26 11:52:40 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 169.9 KB, free 169.9 KB)
16/01/26 11:52:40 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 15.3 KB, free 185.2 KB)
16/01/26 11:52:40 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.1.30.112:35540 (size: 15.3 KB, free: 511.5 MB)
16/01/26 11:52:40 INFO spark.SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2
>>> 16/01/26 11:52:40 INFO mesos.CoarseMesosSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (hadoopnn1.nogle.com:39627) with ID 59f03bb4-1760-412f-a2ee-fb98d21ad6af-S3
16/01/26 11:52:40 INFO storage.BlockManagerMasterEndpoint: Registering block manager hadoopnn1.nogle.com:43934 with 511.5 MB RAM, BlockManagerId(59f03bb4-1760-412f-a2ee-fb98d21ad6af-S3, hadoopnn1.nogle.com, 43934)

16/01/26 11:52:43 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
16/01/26 11:52:43 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev ee66447eca13aaf50524c5266e409796640134a8]
16/01/26 11:52:43 INFO mapred.FileInputFormat: Total input paths to process : 1
..................
..................
16/01/26 11:52:44 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, hadoopnn1.nogle.com): java.lang.IllegalArgumentException: java.net.UnknownHostException: hadoopcluster1
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:665)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:601)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
    at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
    at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
    at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
    at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
    at scala.Option.map(Option.scala:145)
    at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
    at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:212)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
    at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: hadoopcluster1
    ... 38 more









>>> file = sc.textFile("hdfs://hadoopcluster1/user/root/readme.txt")
file.count()16/01/26 11:52:46 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 91.2 KB, free 285.8 KB)
16/01/26 11:52:46 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 21.3 KB, free 307.0 KB)
16/01/26 11:52:46 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 10.1.30.112:35540 (size: 21.3 KB, free: 511.5 MB)
16/01/26 11:52:46 INFO spark.SparkContext: Created broadcast 2 from textFile at NativeMethodAccessorImpl.java:-2
>>> 
16/01/26 11:52:46 INFO mapred.FileInputFormat: Total input paths to process : 1
16/01/26 11:52:46 INFO spark.SparkContext: Starting job: count at <stdin>:1
16/01/26 11:52:46 INFO scheduler.DAGScheduler: Got job 1 (count at <stdin>:1) with 2 output partitions
16/01/26 11:52:46 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (count at <stdin>:1)
16/01/26 11:52:46 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/01/26 11:52:46 INFO scheduler.DAGScheduler: Missing parents: List()
16/01/26 11:52:46 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (PythonRDD[5] at count at <stdin>:1), which has no missing parents
16/01/26 11:52:46 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 5.7 KB, free 312.7 KB)
16/01/26 11:52:46 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 3.7 KB, free 316.4 KB)
16/01/26 11:52:46 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 10.1.30.112:35540 (size: 3.7 KB, free: 511.5 MB)
16/01/26 11:52:46 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
16/01/26 11:52:46 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (PythonRDD[5] at count at <stdin>:1)
16/01/26 11:52:46 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
16/01/26 11:52:46 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 8, hadoopnn1.nogle.com, partition 0,NODE_LOCAL, 2144 bytes)
16/01/26 11:52:46 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 9, hadoopnn1.nogle.com, partition 1,NODE_LOCAL, 2144 bytes)
16/01/26 11:52:46 INFO spark.ContextCleaner: Cleaned accumulator 2
16/01/26 11:52:46 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on 10.1.30.112:35540 in memory (size: 3.7 KB, free: 511.5 MB)
16/01/26 11:52:46 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on hadoopnn1.nogle.com:43934 (size: 3.7 KB, free: 511.5 MB)
16/01/26 11:52:46 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on hadoopnn1.nogle.com:43934 in memory (size: 3.7 KB, free: 511.5 MB)
16/01/26 11:52:46 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on hadoopnn1.nogle.com:43934 (size: 21.3 KB, free: 511.5 MB)
16/01/26 11:52:47 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 8) in 958 ms on hadoopnn1.nogle.com (1/2)
16/01/26 11:52:47 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 9) in 958 ms on hadoopnn1.nogle.com (2/2)
16/01/26 11:52:47 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
16/01/26 11:52:47 INFO scheduler.DAGScheduler: ResultStage 1 (count at <stdin>:1) finished in 0.961 s
16/01/26 11:52:47 INFO scheduler.DAGScheduler: Job 1 finished: count at <stdin>:1, took 1.006764 s
132

我尝试将导出HADOOP_CONF_DIR添加到spark-env.sh但仍然相同。

export HADOOP_CONF_DIR="/etc/hadoop/conf"

似乎发生了一些事情,所以我搜索了stackoverflow并找到了一个可能的问题,所以我将workaround配置添加到spark-defaults.xml并且它的工作。

spark.files file:///etc/hadoop/conf/hdfs-site.xml,file:///etc/hadoop/conf/core-site.xml

UnknownHostException with Mesos + Spark and custom Jar

所以我使用解决方法来处理它,但我仍然想问是否有更好的解决方案来处理它?<​​/ p>

0 个答案:

没有答案