我当前正在使用以下数据:
https://github.com/apache/spark/blob/master/data/mllib/iris_libsvm.txt
我要更新的是OneVsRest模型,以预测概率大于某个阈值的类(多标签预测),而不是最大概率地预测类。
到目前为止,我已经能够实现此功能,但是无法在管道中实现。
from pyspark.ml.classification import RandomForestClassifier, OneVsRest
from pyspark.mllib.evaluation import MultilabelMetrics
from pyspark.sql.types import DoubleType, IntegerType, ArrayType
from pyspark.sql.functions import lit, udf, row_number, col, array
from pyspark.sql.window import Window
# load data file.
inputData = spark.read.format("libsvm") \
.load("iris_libsvm.txt")
# generate the train/test split.
(train, test) = inputData.randomSplit([0.8, 0.2])
# instantiate the base classifier.
rf = RandomForestClassifier()
# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=rf)
# train the multiclass model.
ovrModel = ovr.fit(train)
# score the model on test data. (default scoring)
predictions = ovrModel.transform(test)
一旦我拟合了整个模型,就可以使用ovrModel.models访问所有每个类的模型,并循环遍历这些模型,以计算测试集中每个观察值上每个类的概率,然后选择阈值大于pr的类。
def ith_(v, i):
try:
return float(v[i])
except ValueError:
return None
ith = udf(ith_, DoubleType())
# Looping over each model over the test set and calculating the probability of each class.
# pr0 is the column that represents the probability of class 0
# pr1 is the column that represents the probability of class 1
# ...
for i in range(len(ovrModel.models)):
pr = ovrModel.models[i].transform(test)
w = Window().orderBy("features")
pr = pr.withColumn("row_number", row_number().over(w))
predictions = predictions.withColumn("row_number", row_number().over(w))
pr = pr.select("row_number", ith("probability", lit(1)))
pr = pr.select("row_number", col('ith_(probability, 1)').alias("pr" + str(i)))
predictions = predictions.join(pr, "row_number")
predictions = predictions.drop("row_number")
cols = predictions.columns
cols = [elt for elt in cols if elt.startswith("pr")]
cols.remove("prediction")
# Once probabilities are calculated, I use a 0.5 threshold to determine which class(es) to predict.
# I store result in a column preds of type ArrayType(DoubleType())
threshold = 0.5
for c in cols:
predictions = predictions.withColumn(c, (col(c) >= threshold).cast('int'))
def index_(v):
l = []
for i, j in enumerate(v):
if j == 1:
l.append(i)
return l
index = udf(index_, ArrayType(IntegerType()))
predictions = predictions.withColumn('preds', index("pr"))
predictions = predictions.withColumn("preds", predictions.preds.cast("array<double>"))
# Transform the type of label column to ArrayType(DoubleType()) as well
predictions = predictions.withColumn("label", array("label"))
最终数据框应如下所示:
完成后,我可以使用Spark预定义的多标签指标来计算指标,例如准确性。
scoresAndLabels = predictions.select("label", "preds").rdd
metrics = MultilabelMetrics(scoresAndLabels)
# Summary stats
print("Recall = %s" % metrics.recall())
print("Precision = %s" % metrics.precision())
print("F1 measure = %s" % metrics.f1Measure())
print("Accuracy = %s" % metrics.accuracy)
我的问题是: 如何在自定义转换器中ovrModel = ovr.fit(train)之后立即实现所有步骤,以便以后可以在管道中使用它进行交叉验证。
预先感谢