运行管道时获取ImportError(但仅在集群上)

时间:2018-05-09 16:43:46

标签: pyspark luigi

我只是不明白这一点。

本地我可以毫无问题地运行管道。但是,在集群上运行管道似乎我无法使用本地项目中的代码:

ImportError: No module named 'themodule'
,[Ljava.lang.StackTraceElement;@282acd6d,org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/worker.py", line 159, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
  File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/worker.py", line 91, in read_udfs
    _, udf = read_single_udf(pickleSer, infile)
  File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/worker.py", line 78, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/worker.py", line 54, in read_command
    command = serializer._read_with_length(file)
  File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/grid/3/hadoop/yarn/local/usercache/sfalk/appcache/application_1520347847754_0722/container_e48_1520347847754_0722_01_000006/pyspark.zip/pyspark/serializers.py", line 419, in loads
    return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'themodule'

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:124)
    at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:68)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
    at org.apache.spark.scheduler.Task.run(Task.scala:85)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

所以.. themodule是我项目的根,themodule/otherstuff包含我外包的方法和功能。

themodule/cluster/pipeline/run_luigi.py
themodule/otherstuff/

是我的管道启动的地方。看起来像这样:

import logging    
import luigi    
from themodule.cluster.pipeline.task import PreprocessRawData

class MainTask(luigi.Task):

    def requires(self):
        return PreprocessRawData()

    def run(self):
        logging.info('All tasks done.')
        return

    def output(self):
        return self.requires().output()


if __name__ == '__main__':
    luigi.run()
    print('All done.')

本地 - &gt;没问题。在群集上 - &gt;见上文。

最糟糕的是,我甚至没有看到导入失败的地方。无论如何,它不应该失败。

我知道这不是很好但是这是我正在执行的一个小脚本,以便在此时启动管道:

#!/bin/bash

PYTHONPATH=.:$PYTHONPATH PYTHONHASHSEED=0 SPARK_YARN_USER_ENV="PYTHONHASHSEED=0" PYSPARK_PYTHON=/usr/local/bin/python3.5 SPARK_MAJOR_VERSION=2 LUIGI_CONFIG_PATH=/home/sfalk/workspaces/theproject/python/luigi-cluster.cfg /usr/local/bin/python3.5 themodule/cluster/pipeline/run_luigi.py --local-scheduler MainTask

我们不要忘记这在本地工作。

这可能很简单,但我看不到我错过的东西。

0 个答案:

没有答案