来自python worker的错误:/ usr / bin / python没有名为pyspark的模块

时间:2015-09-16 12:28:36

标签: python hadoop apache-spark pyspark biginsights

我正在尝试在Yarn上运行Pyspark,但是当我在控制台上键入任何命令时,我收到以下错误。

我可以在本地和纱线模式下在Spark中运行scala shell。 Pyspark在本地模式下运行正常,但在纱线模式下不起作用。

操作系统:RHEL 6.x

Hadoop发布:IBM BigInsights 4.0

Spark版本:1.2.1

  

WARN scheduler.TaskSetManager:阶段0.0中的丢失任务0.0(TID 0,work):org.apache.spark.SparkException:   python worker出错:     / usr / bin / python:没有名为pyspark的模块   PYTHONPATH是:     /mnt/sdj1/hadoop/yarn/local/filecache/13/spark-assembly.jar (我的评论:这个路径不存在于linux文件系统上,但逻辑数据节点上) < /强>   java.io.EOFException的           at java.io.DataInputStream.readInt(DataInputStream.java:392)           在org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)           在org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)           在org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)           在org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:102)           在org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)           在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)           在org.apache.spark.rdd.RDD.iterator(RDD.scala:247)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)           在org.apache.spark.scheduler.Task.run(Task.scala:56)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:200)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)           at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615)           在java.lang.Thread.run(Thread.java:745)

我通过导出命令设置了SPARK_HOME和PYTHONPATH,如下所示

export SPARK_HOME=/path/to/spark
export PYTHONPATH=/path/to/spark/python/:/path/to/spark/lib/spark-assembly.jar

有人可以帮我解决这个问题吗?

答案:

经过一番挖掘后,我发现pyspark在Big Insights 4.0中确实存在一些问题。建议我们升级到BI 4.1。

0 个答案:

没有答案