Question

在EMR上提交此Spark作业时出现以下错误

_pickle.PicklingError：无法序列化对象：异常：似乎您正在尝试从广播变量，操作或转换引用SparkContext。 SparkContext只能在驱动程序上使用，而不能在工作程序上运行的代码中使用。有关更多信息，请参阅SPARK-5063。

我已在运行时将所有类属性复制到局部变量。我似乎仍然找不到解决方案。

我无法更改调用此函数的库函数以将spark上下文作为输入发送或使用独立的spark会话。因此，我的输入受到限制

有人可以帮忙吗？

from pyspark.sql import SparkSession

from Somewhere import (
    TEMPLATE_SCHEMA,
    to_template_row,
)

class SomeThing:
    def __init__(
        self,
        spark: SparkSession,
        train_path: str,
        model_type: str,
        o_path: str,
    ):
        self.spark = spark
        self.train_path = train_path
        self.train_rdd = spark.read.option("sep", "\t").csv(train_path).rdd
        self.model_type = model_type
        self.o_path = o_path

    def run(self):
        l_spark = self.spark
        l_model_type = self.model_type
        l_train_rdd = self.train_rdd
        l_o_path = self.o_path

        l_spark.createDataFrame(
              data=l_train_rdd.
              filter(lambda l: l[2] == str(l_model_type)).
              map(lambda l: l[3]).
              distinct().
              map(to_template_row),
            schema=TEMPLATE_SCHEMA).\
        write.parquet(l_o_path)

Spark：您似乎正在尝试从广播变量，操作或转换中引用SparkContext

0 个答案: