我正在gcp dataproc集群上运行以下pyspark代码。 RewriteData(object)类:
def __init__(self):
self.sc = SparkContext(conf=self.conf)
def read_data(self):
data = self.sc.textFile("gcs://test-bucket/input_data/*")
def run(self):
data = self.read_data()
if __name__ == "__main__":
obj = RewriteData()
obj.run()
但是我遇到以下错误错误。
"It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
Handling run-time error: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
奇怪的是,当我将spark上下文初始化移到运行方法时,它起作用了。 不明白为什么? 感谢任何帮助
谢谢 曼尼什(Manish)