Question

使用python3.6测试现有代码，但是某些曾经与python 2.7一起使用的udf如何无法按原样工作，因此无法弄清问题出在哪里。是否有人在本地或分布式方式面临类似问题？类似于https://github.com/mlflow/mlflow/issues/797

Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 202, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):+details
Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 202, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 219, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, eval_type)
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 139, in read_udfs
    arg_offsets, udf = read_single_udf(pickleSer, infile, eval_type)
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 119, in read_single_udf
    f, return_type = read_command(pickleSer, infile)
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 59, in read_command
    command = serializer._read_with_length(file)
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 170, in _read_with_length
    return self.loads(obj)
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809/lib/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 559, in loads
    return pickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'project'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:83)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:66)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1126)
    at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1132)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
    at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:52)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1992)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)

Driver stacktrace:

Answer 1

1. My project has sub packages and then a sub package
pkg
   subpckg1
          subpkg2
                .py
2. from my Main.py im calling a UDF which will be calling a function in subpkg2(.py) file
3 .due to more nesting functions and inter communication UDF's with lot other functions some how spark job couldn't find the subpkg2 files

solution :
create a egg file of the pkg and send via --py-files.

Answer 2

我遇到了类似的情况，@Avinash 的回答对我有用。如果子包嵌套在其他包下面并且在代码中直接引用了子包，我必须为子包模块创建一个单独的zip文件（在这种情况下为subpkg2 ) 并将其添加为使用 addPyFile 到 Spark Context。

scripts
|__ analysis.py
pkg
|__ __init__.py
|__ subpkg1
    |__ __init__.py
    |__ subpkg2
        | __init__.py
        |__ file1.py

#########################
## scripts/analysis.py ##
#########################

# Add pkg to path
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)

# Sub package referenced directly
from subpkg2 import file1

...
...

spark = (
    SparkSession
    .builder()
    .master("local[*]")
    .appname("some app")
    .getOrCreate()
)
# Need to add this, else references to sub package when using UDF do not work
spark.sparkContext.addPyFile(subpkg2.zip)

...
...

# Some code here that uses Pandas UDF with PySpark

我还注意到，在 Cloudera Data Science Workbench 中（我不确定它是通用发现还是特定于 CDSW），如果 subpkg2 位于根级别（即它是一个包而不是子包 - 未嵌套在 pkg 和 subpkg1 中），那么我不必压缩 subpkg2 并且 UDF 能够直接识别所有自定义模块。我不确定为什么会这样。我还在寻找这个问题的答案

scripts
|__ analysis.py
subpkg2
|__ __init__.py
|__ file1.py

#########################
## scripts/analysis.py ##
#########################

# Everything is same as the original example, except that there is 
# no need to specify this line. For some reason, UDF's recognize module
# references at the top level but not submodule references.

# spark.sparkContext.addPyFile(subpkg.zip)

这让我进入了我在原始示例上尝试的最终调试。如果我们将文件中的引用更改为以 pkg.subpkg1 开头，那么我们不必将 subpkg.zip 传递给 Spark Context。

#########################
## scripts/analysis.py ##
#########################

# Add pkg to path
path = os.path.join(os.path.dirname(__file__), os.pardir)
sys.path.append(path)

# Specify full path here
from pkg.subpkg1.subpkg2 import file1

...
...

spark = (
    SparkSession
    .builder()
    .master("local[*]")
    .appname("some app")
    .getOrCreate()
)
# Don't need to add the zip file anymore since we changed the imports to use the full path
# spark.sparkContext.addPyFile(subpkg2.zip)

...
...

# Some code here that uses Pandas UDF with PySpark

PySpark自定义UDF ModuleNotFoundError：未命名模块

2 个答案: