Pyspark插座连接

时间:2017-08-22 19:04:13

标签: java python sockets apache-spark pyspark

我在一组VM上安装Spark。我还应该注意,我遵循了过去在物理服务器和虚拟机上多次使用的相同安装过程,并且从未见过这个问题。我很困惑为什么我现在看到这个。

然而,似乎pyspark在初始化SparkContext时遇到了一些问题。

>pyspark
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/22 13:24:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/22 13:24:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Traceback (most recent call last):
  File "/home/jon/spark/python/pyspark/shell.py", line 43, in <module>
    spark = SparkSession.builder\
  File "/home/jon/spark/python/pyspark/sql/session.py", line 169, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/home/jon/spark/python/pyspark/context.py", line 310, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/home/jon/spark/python/pyspark/context.py", line 118, in __init__
    conf, jsc, profiler_cls)
  File "/home/jon/spark/python/pyspark/context.py", line 188, in _do_init
    self._accumulatorServer = accumulators._start_update_server()
  File "/home/jon/spark/python/pyspark/accumulators.py", line 259, in _start_update_server
    server = AccumulatorServer(("localhost", 0), _UpdateRequestHandler)
  File "/apps/usr/local64/anaconda/lib/python2.7/SocketServer.py", line 417, in __init__
    self.server_bind()
  File "/apps/usr/local64/anaconda/lib/python2.7/SocketServer.py", line 431, in server_bind
    self.socket.bind(self.server_address)
  File "/apps/usr/local64/anaconda/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.gaierror: [Errno -2] Name or service not known
>>> quit()

有趣的是,spark-shell并未显示此问题。我的直觉是,Python连接到JVM启动的服务器时出现问题。有没有人对如何解决/调试这个有任何建议?

>spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/22 13:13:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/22 13:13:59 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://172.25.5.46:4040
Spark context available as 'sc' (master = local[*], app id = local-1503425633272).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_25)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

当我尝试启动一个简单的程序时:

我看到以下错误,类似于上面的

spark-submit test-pyspark.py
17/08/22 13:47:37 INFO SparkContext: Running Spark version 2.1.1
17/08/22 13:47:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/22 13:47:37 INFO SecurityManager: Changing view acls to: jon
17/08/22 13:47:37 INFO SecurityManager: Changing modify acls to: jon
17/08/22 13:47:37 INFO SecurityManager: Changing view acls groups to:
17/08/22 13:47:37 INFO SecurityManager: Changing modify acls groups to:
17/08/22 13:47:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(jon); groups with view permissions: Set(); users  with modify permissions: Set(jon); groups with modify permissions: Set()
17/08/22 13:47:38 INFO Utils: Successfully started service 'sparkDriver' on port 51440.
17/08/22 13:47:38 INFO SparkEnv: Registering MapOutputTracker
17/08/22 13:47:38 INFO SparkEnv: Registering BlockManagerMaster
17/08/22 13:47:38 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/08/22 13:47:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/08/22 13:47:38 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-c3ad2263-4416-45f2-927b-8517e4f3213f
17/08/22 13:47:38 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/08/22 13:47:38 INFO SparkEnv: Registering OutputCommitCoordinator
17/08/22 13:47:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/08/22 13:47:38 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://172.25.5.46:4040
17/08/22 13:47:38 INFO SparkContext: Added file file:/home/jon/test-pyspark.py at file:/home/jon/test-pyspark.py with timestamp 1503427658741
17/08/22 13:47:38 INFO Utils: Copying /home/jon/test-pyspark.py to /tmp/spark-71ba944d-e11b-4cd5-bfcc-386f85b28a9a/userFiles-095d828d-24ec-43a2-ac58-4d9eb07177aa/test-pyspark.py
17/08/22 13:47:38 INFO Executor: Starting executor ID driver on host localhost
17/08/22 13:47:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 56262.
17/08/22 13:47:38 INFO NettyBlockTransferService: Server created on 172.25.5.46:56262
17/08/22 13:47:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/08/22 13:47:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 172.25.5.46, 56262, None)
17/08/22 13:47:38 INFO BlockManagerMasterEndpoint: Registering block manager 172.25.5.46:56262 with 366.3 MB RAM, BlockManagerId(driver, 172.25.5.46, 56262, None)
17/08/22 13:47:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 172.25.5.46, 56262, None)
17/08/22 13:47:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 172.25.5.46, 56262, None)
17/08/22 13:47:39 INFO SparkUI: Stopped Spark web UI at http://172.25.5.46:4040
17/08/22 13:47:39 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/08/22 13:47:39 INFO MemoryStore: MemoryStore cleared
17/08/22 13:47:39 INFO BlockManager: BlockManager stopped
17/08/22 13:47:39 INFO BlockManagerMaster: BlockManagerMaster stopped
**17/08/22 13:47:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!**
**17/08/22 13:47:39 INFO SparkContext: Successfully stopped SparkContext**
Traceback (most recent call last):
  File "/home/jon/test-pyspark.py", line 5, in <module>
    sc = SparkContext(conf=conf)
  File "/home/jon/spark/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__
  File "/home/jon/spark/python/lib/pyspark.zip/pyspark/context.py", line 188, in _do_init
  File "/home/jon/spark/python/lib/pyspark.zip/pyspark/accumulators.py", line 259, in _start_update_server
  File "/apps/usr/local64/anaconda/lib/python2.7/SocketServer.py", line 417, in __init__
    self.server_bind()
  File "/apps/usr/local64/anaconda/lib/python2.7/SocketServer.py", line 431, in server_bind
    self.socket.bind(self.server_address)
  File "/apps/usr/local64/anaconda/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.gaierror: [Errno -2] Name or service not known
17/08/22 13:47:39 INFO ShutdownHookManager: Shutdown hook called
17/08/22 13:47:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-71ba944d-e11b-4cd5-bfcc-386f85b28a9a

1 个答案:

答案 0 :(得分:1)

看起来PySpark无法启动用于累加器更新的TCP服务器。 AccumulatorServer is started at localhsost

socket.gaierror: [Errno -2] Name or service not known

和错误:

/etc/hosts

建议使用地址解析。请仔细检查您的网络配置。

Based on

  

看起来像网络配置问题。你可以加入/etc/hosts吗?

看起来解决方案是将权限修复为foodType,以便VM具有读取权限。