这篇文章是关于实现与this post类似的概念的。本质上,我想通过创建决策树回归器(DTR)然后使用叶节点作为线性回归模型的输入来创建增强型决策树回归器。使用流水线和交叉验证,我可以创建和训练决策树回归器,但是不确定如何使用DTR叶节点作为线性回归模型的输入,如下所示。任何帮助将不胜感激。
# Let's split our data into training data and testing data
trainTest = data2.randomSplit([0.8, 0.2])
trainingDF = trainTest[0]
testDF = trainTest[1]
assembler = VectorAssembler(
inputCols = ['passenger_count','store_and_fwd_flag','pickup_day','dropoff_day','pickup_month','dropoff_month','pickup_hour','dropoff_hour','distance'],
outputCol = 'features'
)
# Now create our decision tree and linear regressoin models
dtr = DecisionTreeRegressor(featuresCol="features", labelCol="trip_duration", predictionCol="prediction1")
lir = LinearRegression(featuresCol='features', labelCol='prediction1', predictionCol='prediction2')
# Create Evaluator
prmse = RegressionEvaluator(labelCol="trip_duration", predictionCol="prediction2", metricName="rmse")
pipeline = Pipeline(stages=[assembler, dtr, lir])
paramGrid = ParamGridBuilder() \
.addGrid(dtr.maxDepth, [10]) \
.addGrid(lir.maxIter, [10]) \
.addGrid(lir.regParam, [.01, .1, 1]) \
.addGrid(lir.elasticNetParam, [.5, .75, 1]) \
.build()
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=prmse, numFolds=10)
# Apply cross validation to the training data and generate a model
model = cv.fit(trainingDF)
predictions = model.transform(testDF).cache()