PySpark定义一个自定义变压器,将上一步的拟合模型作为输入

时间:2018-11-09 15:12:27

标签: python apache-spark machine-learning pyspark

我当前正在使用以下数据:

https://github.com/apache/spark/blob/master/data/mllib/iris_libsvm.txt

我要更新的是OneVsRest模型,以预测概率大于某个阈值的类(多标签预测),而不是最大概率地预测类。

到目前为止,我已经能够实现此功能,但是无法在管道中实现。

from pyspark.ml.classification import RandomForestClassifier, OneVsRest
from pyspark.mllib.evaluation import MultilabelMetrics

from pyspark.sql.types import DoubleType, IntegerType, ArrayType
from pyspark.sql.functions import lit, udf, row_number, col, array
from pyspark.sql.window import Window

# load data file.
inputData = spark.read.format("libsvm") \
    .load("iris_libsvm.txt")

# generate the train/test split.
(train, test) = inputData.randomSplit([0.8, 0.2])

# instantiate the base classifier.
rf = RandomForestClassifier()

# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=rf)

# train the multiclass model.
ovrModel = ovr.fit(train)

# score the model on test data. (default scoring)
predictions = ovrModel.transform(test)

一旦我拟合了整个模型,就可以使用ovrModel.models访问所有每个类的模型,并循环遍历这些模型,以计算测试集中每个观察值上每个类的概率,然后选择阈值大于pr的类。

def ith_(v, i):
    try:
        return float(v[i])
    except ValueError:
        return None

ith = udf(ith_, DoubleType())

# Looping over each model over the test set and calculating the probability of each class. 
# pr0 is the column that represents the probability of class 0 
# pr1 is the column that represents the probability of class 1 
# ...


for i in range(len(ovrModel.models)):
    pr = ovrModel.models[i].transform(test)
    w = Window().orderBy("features")
    pr = pr.withColumn("row_number", row_number().over(w))
    predictions = predictions.withColumn("row_number", row_number().over(w))

    pr = pr.select("row_number", ith("probability", lit(1)))
    pr = pr.select("row_number", col('ith_(probability, 1)').alias("pr" + str(i)))

    predictions = predictions.join(pr, "row_number")
    predictions = predictions.drop("row_number")

cols = predictions.columns
cols = [elt for elt in cols if elt.startswith("pr")]
cols.remove("prediction")

# Once probabilities are calculated, I use a 0.5 threshold to determine which class(es) to predict. 
# I store result in a column preds of type ArrayType(DoubleType())

threshold = 0.5

for c in cols:
    predictions = predictions.withColumn(c, (col(c) >= threshold).cast('int'))

def index_(v):
    l = []
    for i, j in enumerate(v):
        if j == 1:
            l.append(i)
    return l

index = udf(index_, ArrayType(IntegerType()))

predictions = predictions.withColumn('preds', index("pr"))
predictions = predictions.withColumn("preds", predictions.preds.cast("array<double>"))

# Transform the type of label column to ArrayType(DoubleType()) as well
predictions = predictions.withColumn("label", array("label"))

最终数据框应如下所示:

enter image description here

完成后,我可以使用Spark预定义的多标签指标来计算指标,例如准确性。

scoresAndLabels = predictions.select("label", "preds").rdd
metrics = MultilabelMetrics(scoresAndLabels)

# Summary stats
print("Recall = %s" % metrics.recall())
print("Precision = %s" % metrics.precision())
print("F1 measure = %s" % metrics.f1Measure())
print("Accuracy = %s" % metrics.accuracy)

我的问题是: 如何在自定义转换器中ovrModel = ovr.fit(train)之后立即实现所有步骤,以便以后可以在管道中使用它进行交叉验证。

预先感谢

0 个答案:

没有答案