如何为此pysaprk代码创建用户定义函数?

时间:2019-05-16 10:35:58

标签: pyspark

我想在我的框架中实现此模型,所以请您告诉我如何为该代码编写用户定义函数。

    from pyspark.sql import SparkSession

    from pyspark.ml.classification import LogisticRegression

    spark = SparkSession.builder.appName('titanic_logreg').getOrCreate()
    df = spark.read.csv('titanic.csv', inferSchema = True, header = True)

    df.show(3)

    df.printSchema()

    df.columns

    my_col = df.select([ 'Survived',
     'Pclass',
     'Sex',
     'Age',
     'SibSp',
     'Parch',
     'Fare',
     'Embarked'])

    final_data = my_col.na.drop()

    final_data.show(3)

    from pyspark.ml.feature import (VectorAssembler, StringIndexer, VectorIndexer, OneHotEncoder)

    gender_indexer = StringIndexer(inputCol = 'Sex', outputCol = 'SexIndex')

    gender_encoder = OneHotEncoder(inputCol = 'SexIndex', outputCol = 'SexVec')

    embark_indexer = StringIndexer(inputCol = 'Embarked', outputCol = 'EmbarkIndex')

    embark_encoder = OneHotEncoder(inputCol = 'EmbarkIndex', outputCol = 'EmbarkVec')

    assembler = VectorAssembler(inputCols = ['Pclass', 'SexVec', 'Age', 'SibSp', 'Fare', 'EmbarkVec'], outputCol = 'features')

    from pyspark.ml import Pipeline

    log_reg = LogisticRegression(featuresCol='features', labelCol = 'Survived')

    pipeline= Pipeline(stages= [gender_indexer, embark_indexer,
                               gender_encoder, embark_encoder,
                               assembler, log_reg])


    train, test = final_data.randomSplit([0.7, 0.3])
    fit_model = pipeline.fit(train)
    results = fit_model.transform(test)

    results.select('prediction', 'Survived').show(3)

    from pyspark.ml.evaluation import BinaryClassificationEvaluator

    eval = BinaryClassificationEvaluator(rawPredictionCol = 'rawPredic`enter code here`tion', labelCol = 'Survived')

    AUC = eval.evaluate(results)

    AUC

我想知道是否可以为pyspark程序创建UDF。

0 个答案:

没有答案