使用GradientBoostingClassifier时出现数据框错误

时间:2020-07-15 16:36:26

标签: pyspark model databricks

运行代码时出现此错误:TypeError:无法识别类型为的管道阶段,尤其是当我将管道适合数据时。我想我可能没有正确加载csv,但是不确定,这是我的代码:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

import pyspark.sql.functions as F
import numpy as np
from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer,OneHotEncoderEstimator

from pyspark.sql.types import DoubleType

df = spark.read.format("csv").option("header", "true").load("FileStore/tables/data.csv")


str_indxr = StringIndexer(inputCol="PointDiff", outputCol="label")
str_indxr = str_indxr.fit(df).transform(df)

str_indxr.columns
vec_assmblr = VectorAssembler(inputCols=['label','col1', 'col2', 'col3'], outputCol='features_norm')
splits =df.randomSplit([0.8, 0.2])
df_train = splits[0]
df_test = splits[1]
gbt = GBTClassifier(labelCol="label", featuresCol="features_norm", maxIter=10)
pip_line = Pipeline(stages=[str_indxr,vec_assmblr,gbt])
pip_line_fit = pip_line.fit(df_train)

df_tran = pip_line_fit.transform(df_test)

2 个答案:

答案 0 :(得分:0)

在代码str_indxr = str_indxr.fit(df).transform(df)的这一行中,您已经将字符串索引器阶段转换为数据帧。因此,在此行pip_line = Pipeline(stages=[str_indxr,vec_assmblr,gbt])中使用它时,str_indxr现在是一个数据框,而不是字符串索引器阶段。

因此,跳过该步骤或将str_indxr = str_indxr.fit(df).transform(df)的输出分配给其他名称。

另一个提示-您还在VectorAssembler输入中使用了目标标签。这是不正确的。

答案 1 :(得分:0)

我认为您的设置应该是这样的。

from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GBT model.
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=10)

# Chain indexers and GBT in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, gbt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

gbtModel = model.stages[2]
print(gbtModel)  # summary only

此处的文档:

https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier