我使用以下语句将数据框保存到PySpark中的CSV文件中:
df_all.repartition(1).write.csv("xyz.csv", header=True, mode='overwrite')
但是我遇到了错误
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 218, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 138, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 118, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 58, in read_command
command = serializer._read_with_length(file)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/spark-2.3.0-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'app'
我正在使用PySpark 2.3.0版
尝试写入文件时出现此错误。
import json, jsonschema
from pyspark.sql import functions
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType, StringType, FloatType
from datetime import datetime
import os
feb = self.filter_data(self.SRC_DIR + "tl_feb19.csv", 13)
apr = self.filter_data(self.SRC_DIR + "tl_apr19.csv", 15)
df_all = feb.union(apr)
df_all = df_all.dropDuplicates(subset=["PRIMARY_ID"])
create_emi_amount_udf = udf(create_emi_amount, FloatType())
df_all = df_all.withColumn("EMI_Amount", create_emi_amount_udf('Sanction_Amount', 'Loan_Type'))
df_all.write.csv(self.DST_DIR + "merged_amounts.csv", header=True, mode='overwrite')
答案 0 :(得分:2)
错误非常清楚,没有模块'app'。 您的Python代码在驱动程序上运行,但是udf在执行程序PVM上运行。。当您调用 udf 时,spark会序列化create_emi_amount
并将其发送给执行程序。
因此,在方法create_emi_amount
中的某个地方,您将使用或导入应用程序模块。
解决问题的方法是在驱动程序和执行程序中使用相同的环境。在spark-env.sh
中,在PYSPARK_DRIVER_PYTHON=...
和PYSPARK_PYTHON=...
中设置保存Python virtualenv。