尝试在一些训练数据上实现决策树回归算法,但是当我调用fit()时遇到错误。
backchannel
产生错误
(trainingData, testData) = data.randomSplit([0.7, 0.3])
vecAssembler = VectorAssembler(inputCols=["_1", "_2", "_3", "_4", "_5", "_6", "_7", "_8", "_9", "_10"], outputCol="features")
dt = DecisionTreeRegressor(featuresCol="features", labelCol="_11")
dt_model = dt.fit(trainingData)
但是数据结构完全相同。
答案 0 :(得分:0)
您缺少两个步骤。 1.转换部分,以及2.从转换后的数据中选择特征和标签。我假设数据只包含数字数据,即没有分类数据。我将写下使用pyspark.ml
来帮助您的模型训练的一般流程。
from pyspark.ml.feature
from pyspark.ml.classification import DecisionTreeClassifier
#date processing part
vecAssembler = VectorAssembler(input_cols=['col_1','col_2',...,'col_10'],outputCol='features')
#you missed these two steps
trans_data = vecAssembler.transform(data)
final_data = trans_data.select('features','col_11') #your label column name is col_11
train_data, test_data = final_data.randomSplit([0.7,0.3])
#ml part
dt = DecisionTreeClassifier(featuresCol='features',labelCol='col_11')
dt_model = dt.fit(train_data)
dt_predictions = dt_model.transform(test_data)
#proceed with the model evaluation part after this