使用python3.6测试现有代码,但是某些曾经与python 2.7一起使用的udf如何无法按原样工作,因此无法弄清问题出在哪里。是否有人在本地或分布式方式面临类似问题?类似于https://github.com/mlflow/mlflow/issues/797
Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 202, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):+details
Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 202, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs
arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 119, in read_single_udf
f, return_type = read_command(pickleSer, infile)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 59, in read_command
command = serializer._read_with_length(file)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
return self.loads(obj)
File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'project'
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:83)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:66)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:52)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
Driver stacktrace:
答案 0 :(得分:1)
1. My project has sub packages and then a sub package
pkg
subpckg1
subpkg2
.py
2. from my Main.py im calling a UDF which will be calling a function in subpkg2(.py) file
3 .due to more nesting functions and inter communication UDF's with lot other functions some how spark job couldn't find the subpkg2 files
solution :
create a egg file of the pkg and send via --py-files.
答案 1 :(得分:1)
我遇到了类似的情况,@Avinash 的回答对我有用。如果子包嵌套在其他包下面并且在代码中直接引用了子包,我必须为子包模块创建一个单独的zip文件(在这种情况下为subpkg2
) 并将其添加为使用 addPyFile 到 Spark Context。
scripts
|__ analysis.py
pkg
|__ __init__.py
|__ subpkg1
|__ __init__.py
|__ subpkg2
| __init__.py
|__ file1.py
#########################
## scripts/analysis.py ##
#########################
# Add pkg to path
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
# Sub package referenced directly
from subpkg2 import file1
...
...
spark = (
SparkSession
.builder()
.master("local[*]")
.appname("some app")
.getOrCreate()
)
# Need to add this, else references to sub package when using UDF do not work
spark.sparkContext.addPyFile(subpkg2.zip)
...
...
# Some code here that uses Pandas UDF with PySpark
我还注意到,在 Cloudera Data Science Workbench 中(我不确定它是通用发现还是特定于 CDSW),如果 subpkg2
位于根级别(即它是一个包而不是子包 - 未嵌套在 pkg
和 subpkg1
中),那么我不必压缩 subpkg2
并且 UDF 能够直接识别所有自定义模块。我不确定为什么会这样。我还在寻找这个问题的答案
scripts
|__ analysis.py
subpkg2
|__ __init__.py
|__ file1.py
#########################
## scripts/analysis.py ##
#########################
# Everything is same as the original example, except that there is
# no need to specify this line. For some reason, UDF's recognize module
# references at the top level but not submodule references.
# spark.sparkContext.addPyFile(subpkg.zip)
这让我进入了我在原始示例上尝试的最终调试。如果我们将文件中的引用更改为以 pkg.subpkg1
开头,那么我们不必将 subpkg.zip
传递给 Spark Context。
#########################
## scripts/analysis.py ##
#########################
# Add pkg to path
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)
# Specify full path here
from pkg.subpkg1.subpkg2 import file1
...
...
spark = (
SparkSession
.builder()
.master("local[*]")
.appname("some app")
.getOrCreate()
)
# Don't need to add the zip file anymore since we changed the imports to use the full path
# spark.sparkContext.addPyFile(subpkg2.zip)
...
...
# Some code here that uses Pandas UDF with PySpark