无法从Python运行Apache Spark的Pi示例

时间:2015-01-21 16:57:43

标签: python apache-spark ipython-notebook

我已经设置了我的第一个火花群(1个主人,2个工作人员)和一个我设置用于访问群集的iPython笔记本服务器。我从Anaconda运行工作人员以确保每个盒子上的python设置正确。 iPy笔记本电脑服务器似乎可以正确设置所有内容,并且我能够初始化Spark并推出作业。但是,这项工作失败了,我不确定如何排除故障。这是代码:

from pyspark import SparkContext
from numpy import random
CLUSTER_URL = 'spark://192.168.1.20:7077'
sc = SparkContext( CLUSTER_URL, 'pyspark')
def sample(p):
    from numpy import random
    x, y = random(), random()
    return 1 if x*x + y*y < 1 else 0

count = sc.parallelize(xrange(0, 20)).map(sample).reduce(lambda a, b: a + b)
print "Pi is roughly %f" % (4.0 * count / 20)

这就是错误:

  

Py4JJavaError Traceback(最近一次调用   最后)in()         如果x x + y y&lt; 3,则返回1 1其他0         4   ----&GT; 5 count = sc.parallelize(xrange(0,20))。map(sample).reduce(lambda a,b:a + b)         6打印&#34; Pi约为f&#34; %(4.0 *计数/ 20)

     

/opt/spark-1.2.0/python/pyspark/rdd.pyc in reduce(self,f)       713产量减少(f,迭代器,初始)       714    - &GT; 715 vals = self.mapPartitions(func).collect()       716 if vals:       717 return reduce(f,vals)

     收藏中的

/opt/spark-1.2.0/python/pyspark/rdd.pyc(个体经营)       674&#34;&#34;&#34;       675使用SCCallSiteSync(self.context)作为css:    - &GT; 676 bytesInJava = self._jrdd.collect()。iterator()       677返回列表(self._collect_iterator_through_file(bytesInJava))       678

     

/opt/spark-1.2.0/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py   在调用(self,* args)       536 answer = self.gateway_client.send_command(command)       537 return_value = get_return_value(answer,self.gateway_client,    - &GT; 538 self.target_id,self.name)       539       540为temp_args中的temp_arg:

     

/opt/spark-1.2.0/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in   get_return_value(answer,gateway_client,target_id,name)       298引发Py4JJavaError(       299&#39;调用{0} {1} {2}时发生错误。\ n&#39;。    - &GT; 300格式(target_id,&#39;。&#39;,名称),值)       301其他:       302引发Py4JError(

     

Py4JJavaError:调用o28.collect时发生错误。 :   org.apache.spark.SparkException:作业因阶段失败而中止:   阶段0.0中的任务31失败4次,最近失败:失去任务   31.3阶段0.0(TID 72,192.168.1.21):org.apache.spark.api.python.PythonException:Traceback(最新版本)   最后调用):文件&#34; /opt/spark-1.2.0/python/pyspark/worker.py" ;,行   107,主要       process()File&#34; /opt/spark-1.2.0/python/pyspark/worker.py" ;,第98行,正在处理中       serializer.dump_stream(func(split_index,iterator),outfile)File&#34; /opt/spark-1.2.0/python/pyspark/serializers.py" ;,第227行,in   dump_stream       vs = list(itertools.islice(iterator,batch))文件&#34; /opt/spark-1.2.0/python/pyspark/rdd.py" ;,第710行,在func中       initial = next(iterator)File&#34;&#34;,第2行,示例TypeError:&#39; module&#39;对象不可调用

     

在   org.apache.spark.api.python.PythonRDD $$匿名$ 1.read(PythonRDD.scala:137)     在   org.apache.spark.api.python.PythonRDD $$匿名$ 1(PythonRDD.scala:174)。     在org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:230)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)at at   org.apache.spark.scheduler.Task.run(Task.scala:56)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:196)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:615)     在java.lang.Thread.run(Thread.java:745)

     

驱动程序堆栈跟踪:at   org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1214)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1203)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1202)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:696)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:696)     在scala.Option.foreach(Option.scala:236)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessActor $$ anonfun $获得$ 2.applyOrElse(DAGScheduler.scala:1420)     at akka.actor.Actor $ class.aroundReceive(Actor.scala:465)at   org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375)     在akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)at   akka.actor.ActorCell.invoke(ActorCell.scala:487)at   akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)at   akka.dispatch.Mailbox.run(Mailbox.scala:220)at   akka.dispatch.ForkJoinExecutorConfigurator $ AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)     在   scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)     在   scala.concurrent.forkjoin.ForkJoinPool $ WorkQueue.runTask(ForkJoinPool.java:1339)     在   scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)     在   scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

我甚至不确定从哪里开始调试/诊断,所以任何帮助都将不胜感激。很高兴发布其他日志,如果这将有所帮助。

1 个答案:

答案 0 :(得分:3)

numpy.random是一个Python包,您无法使用random()调用它。

我想你想使用random.random()documentation