我只是不明白这一点。
本地我可以毫无问题地运行管道。但是,在集群上运行管道似乎我无法使用本地项目中的代码:
ImportError: No module named 'themodule'
,[Ljava.lang.StackTraceElement;@282acd6d,org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/worker.py", line 159, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/worker.py", line 91, in read_udfs
_, udf = read_single_udf(pickleSer, infile)
File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/worker.py", line 78, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/worker.py", line 54, in read_command
command = serializer._read_with_length(file)
File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/serializers.py", line 419, in loads
return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'themodule'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:124)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:68)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
所以.. themodule
是我项目的根,themodule/otherstuff
包含我外包的方法和功能。
themodule/cluster/pipeline/run_luigi.py
themodule/otherstuff/
是我的管道启动的地方。看起来像这样:
import logging
import luigi
from themodule.cluster.pipeline.task import PreprocessRawData
class MainTask(luigi.Task):
def requires(self):
return PreprocessRawData()
def run(self):
logging.info('All tasks done.')
return
def output(self):
return self.requires().output()
if __name__ == '__main__':
luigi.run()
print('All done.')
本地 - &gt;没问题。在群集上 - &gt;见上文。
最糟糕的是,我甚至没有看到导入失败的地方。无论如何,它不应该失败。
我知道这不是很好但是这是我正在执行的一个小脚本,以便在此时启动管道:
#!/bin/bash
PYTHONPATH=.:$PYTHONPATH PYTHONHASHSEED=0 SPARK_YARN_USER_ENV="PYTHONHASHSEED=0" PYSPARK_PYTHON=/usr/local/bin/python3.5 SPARK_MAJOR_VERSION=2 LUIGI_CONFIG_PATH=/home/sfalk/workspaces/theproject/python/luigi-cluster.cfg /usr/local/bin/python3.5 themodule/cluster/pipeline/run_luigi.py --local-scheduler MainTask
我们不要忘记这在本地工作。
这可能很简单,但我看不到我错过的东西。