PySpark:尝试从广播中引用SparkContext

时间:2020-04-10 21:25:26

标签: apache-spark pyspark

我正在尝试使用PandasUDF使用XGBoost腌制模型对数据进行评分(预测概率)。我的代码包含在类构造中。我收到以下错误消息,但我不知道如何解决此问题。没有类时没有错误。

错误:

"It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

代码:

import findspark
from pyspark import SparkContext, SparkConf 
from pyspark.sql import HiveContext, SQLContext
import os

findspark.init('/opt/cloudera/parcels/SPARK2/lib/spark2')
os.environ['JAVA_HOME'] = "/usr/java/jdk1.8.0_151-cloudera"
os.environ['PYSPARK_PYTHON'] = "/opt/anaconda3/bin/python"
sc = SparkContext()

sqlContext = SQLContext(sc)
hc = HiveContext(sc)
conf.set("spark.sql.execution.arrow.enabled", "true")

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.ml.feature import *

import pandas as pd 
import numpy as np
import datetime
import pickle
import warnings
warnings.filterwarnings('ignore')

class Score:

    def __init__(self, df, model, feat):
        self.df = df
        self.model = model
        self.feat = feat

    def calculateProbability(self):
        self.df.registerTempTable('df')
        self.df = hc.sql("""
        SELECT *, row_id % 10 AS partition_id
        FROM (SELECT *, row_number() over (ORDER BY rand()) AS row_id FROM df)
        """)

        schema = StructType([StructField('X', IntegerType(), True),
                     StructField('Y', IntegerType(), True),
                     StructField('Z', IntegerType(), True),
                     StructField('SCORE', DoubleType(), True)])

        @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
        def scoreData(df):
            prediction = self.model.predict_proba(df[self.feat], ntree_limit = model1.best_ntree_limit)[:,1]
            col1 = df['X']
            col2 = df['Y']
            col3 = df['X']
            col4 = prediction
            return pd.DataFrame({'X': col1, 'Y': col2, 'X': col3, 'SCORE': col4})

        self.df_scored = self.df.groupby('partition_id').apply(scoreData)

model是一个pickle文件,feat是模型特征列表。

请指出我要去哪里以及如何解决。谢谢。

0 个答案:

没有答案