Spark作业在本地运行但在EMR上失败 - 无法找出原因

时间:2016-08-27 16:39:25

标签: apache-spark pyspark amazon-emr

由于某些原因,当我在EMR上运行作业时,我的管道中的以下函数导致错误(使用emr-5.0.0和Spark 2.0.0):

def autv(self, f=atf):
    """

    Args:
        f:

    Returns:

    """
    if not self._utv:
        raise FileNotFoundError("Data not loaded.")
    ut = self._utv
    try:
        self._utv = (ut
                                    .rdd
                                    .map(lambda x: (x.id, (x.t, x.w)))
                                    .groupByKey()
                                    .map(lambda x: Row(id=x[0],
                                                       w=len(x[1]),
                                                       t=DenseVector(f(x[1]))))
                                    .toDF())
        return self
    except AttributeError as e:
        logging.error(e)
    return None

atf是一个非常简单的功能:

def atf(iterable):
    """

    Args:
        iterable:

    Returns:

    """
    return [stats.mean(t) for t in zip(*list(zip(*iterable))[0])]

我收到了大量错误,但这是最后一部分:

        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:211)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/worker.py", line 161, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/worker.py", line 54, in read_command   
    command = serializer._read_with_length(file)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1472313936084_0003/container_1472313936084_0003_01_000002/pyspark.zip/pyspark/serializers.py", line 419, in loads
    return pickle.loads(obj, encoding=encoding)
ImportError: No module named 'regression'

        at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
        at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        ... 1 more

16/08/27 16:28:43 INFO ShutdownHookManager: Shutdown hook called
16/08/27 16:28:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-429a8665-405e-4a8a-9a0c-7f939020a644
16/08/27 16:28:43 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-429a8665-405e-4a8a-9a0c-7f939020a644/pyspark-41867521-9dfd-4d8f-8b13-33272063e0c3

有一条ImportError: No module named 'regression'消息对我没有意义,因为我的脚本的其余部分正在运行此模块中的函数,当我删除aggregate_user_topic_vectors函数时,脚本运行时没有错误。另外,正如我之前所说,即使使用aggregate_user_topic_vectors,脚本也会在本地计算机上运行而不会出错。我已经设置PYTHONPATH来确定我的项目。真的不知道从哪里开始。任何意见将不胜感激。

1 个答案:

答案 0 :(得分:0)

好吧,正如我所怀疑的那样,我的问题通过从myappMyController(显然是邪恶的)转移到groupByKey来解决(因此,与我导入模块的方式无关)。这是修改后的代码。希望这有助于某人!

reduceByKey