PySpark:ModuleNotFoundError:没有名为“ app”的模块

时间:2019-07-05 10:49:25

标签: apache-spark pyspark

我使用以下语句将数据框保存到PySpark中的CSV文件中:

df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')

但是我遇到了错误

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'

我正在使用PySpark 2.3.0版

尝试写入文件时出现此错误。

    import json, jsonschema
    from pyspark.sql import functions
    from pyspark.sql.functions import udf
    from pyspark.sql.types import IntegerType, StringType, FloatType
    from datetime import datetime
    import os

    feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
    apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)

    df_all = feb.union(apr)
    df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])

    create_emi_amount_udf = udf(create_emi_amount, FloatType())
    df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))

    df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')

1 个答案:

答案 0 :(得分:2)

错误非常清楚,没有模块'app'。 您的Python代码在驱动程序上运行,但是udf在执行程序PVM上运行。。当您调用 udf 时,spark会序列化create_emi_amount并将其发送给执行程序。

因此,在方法create_emi_amount中的某个地方,您将使用或导入应用程序模块。 解决问题的方法是在驱动程序和执行程序中使用相同的环境。在spark-env.sh中,在PYSPARK_DRIVER_PYTHON=...PYSPARK_PYTHON=...中设置保存Python virtualenv。