向pyspark环境添加/部署依赖库

时间:2016-05-14 01:58:49

标签: numpy apache-spark pyspark

我有一个pyspark计划。它需要幕后的numpy库。 numpy未安装在工作节点上,我无权在工作节点上安装它们。当我运行spark-shell时,我使用'--py-files'并在运行时将numpy库发送到工作节点。但是我收到以下错误消息。

File "/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 232.0 failed 4 times, most recent failure: Lost task 0.3 in stage 232.0 (TID 60801, anp-r01wn02.c03.hadoop.td.com): org.apache.spark.SparkException:
Error from python worker:
/usr/bin/python: No module named mtrand
PYTHONPATH was:
/usr/lib/spark/lib/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar:/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python/::/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip:/usr/lib/spark/python/lib/pyspark.zip:/data/10/yarn/nm/usercache/zakerh2/appcache/application_1462889699566_2857/container_e37_1462889699566_2857_01_000332/numpy.zip
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

问题是什么?是不是因为numpy中的另一个依赖?我该如何解决这个问题?

将numpy安装/发送到工作节点的其他选项有哪些?我已经看到一些在运行时使用pip安装python包但是我不确定它在Pyspark中是如何工作的。对此有任何想法或评论吗?

1 个答案:

答案 0 :(得分:0)

在您输入的输出中,我看到错误:

/usr/bin/python: No module named mtrand

在你的代码中,你有......

import mtrand

你需要......

import numpy.random.mtrand