我在4台不同的机器上安装了Spark群集。每台机器都有7.7GB的内存和8核i7处理器。我正在使用Pyspark并尝试将5个numpy阵列(每个2.9gb)加载到集群中。它们都是我在一台独立机器上生成的更大的14gb numpy阵列的一部分。我试图在第一个rdd上运行一个简单的计数函数,以确保我的集群正常运行。执行时我收到以下警告:
>>> import numpy as np
>>> gen1 = sc.parallelize(np.load('/home/hduser/gen1.npy'),512)
>>> gen1.count()
[Stage 0:> (0 + 0) / 512]
17/01/28 13:07:07 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/01/28 13:07:22 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
17/01/28 13:07:37 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
[Stage 0:> (0 + 0) / 512]
17/01/28 13:07:52 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
^C
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/spark/python/pyspark/rdd.py", line 1008, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
File "/opt/spark/python/pyspark/rdd.py", line 999, in sum
return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
File "/opt/spark/python/pyspark/rdd.py", line 873, in fold
vals = self.mapPartitions(func).collect()
File "/opt/spark/python/pyspark/rdd.py", line 776, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/opt/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 931, in __call__
File "/opt/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 695, in send_command
File "/opt/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 828, in send_command
File "/home/hduser/anaconda2/lib/python2.7/socket.py", line 451, in readline
data = self._sock.recv(self._rbufsize)
File "/opt/spark/python/pyspark/context.py", line 223, in signal_handler
raise KeyboardInterrupt()
KeyboardInterrupt
当我检查我的集群UI时,它表示有3个正常运行的工作程序,但只有1个执行程序(驱动程序,与我的主IP相关联)。我假设这是配置问题。
我在spark-env.sh(master)中的设置:
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_MASTER_IP=192.168.1.2
这些设置在每台工作机器上都是相同的。
我在spark-defaults.conf(master)中的设置:
spark.master spark://lebron:7077
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.kryoserializer.buffer.max 128m
每个工作人员只有spark.master
和spark.serializer
配置选项设置如上。
我还需要弄清楚如何调整我的内存管理,因为在此问题出现之前,当我应该有足够的内存时,我正在向左和向右抛出Java堆空间OOM异常。但也许我会将其保存为另一个问题。
请帮忙!
答案 0 :(得分:0)
如果您可以在网络用户界面中找到火花奴隶,但他们不接受工作,则防火墙很可能会阻止通信。
您可以像我的其他答案一样进行测试:Apache Spark on Mesos: Initial job has not accepted any resources