获取TypeErrror:DecisionTreeClassifier'对象在sparkml lib

时间:2018-02-10 11:04:46

标签: apache-spark pyspark apache-spark-mllib

我试图在Coursera"机器学习大数据"的帮助下在spark Mllib中实现一个决策树。我有以下错误

<class 'pyspark.ml.classification.DecisionTreeClassifier'>
Traceback (most recent call last):
  File "C:/sparkcourse/Pycharmproject/Decisiontree.py", line 65, in <module>
    model=modelpipeline.fit(traindata)
  File "C:\spark\python\lib\pyspark.zip\pyspark\ml\base.py", line 64, in fit
  File "C:\spark\python\lib\pyspark.zip\pyspark\ml\pipeline.py", line 93, in _fit
TypeError: 'DecisionTreeClassifier' object is not iterable

这是代码

from pyspark.sql import SparkSession
from pyspark.sql import DataFrameNaFunctions
#pipeline is estimator or transformer
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import Binarizer
from pyspark.ml.feature import VectorAssembler,VectorIndexer,StringIndexer

spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").enableHiveSupport().getOrCreate()

weatherdata=spark.read.csv("file:///SparkCourse/daily_weather.csv",header="true",inferSchema="true")
#print(weatherdata.columns)


#for input features we explicitly take the columns

featurescolumn=['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am', 'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am', 'rain_accumulation_9am', 'rain_duration_9am']
#print(featurescolumn)

weatherdata=weatherdata.drop("number")
#print(weatherdata.columns)

#missing value dealing
weatherdata=weatherdata.na.drop()
#print(weatherdata.count(),len(weatherdata.columns))

#create a categorical variable  to denote if humid is not low(we weill deal heare relative_humidity_3pm column).if value is
#less than 25% then categorical value is 0 or if higher it will be 1. using binarizer will solve this

binarizer=Binarizer(threshold=24.99999,inputCol='relative_humidity_3pm',outputCol='low_humid')
#we transform whole weatherdata into Binarizer categorical value
binarizerDf=binarizer.transform(weatherdata)

#binarizerDf.select("relative_humidity_3pm",'low_humid').show(4)

#aggregating the fetures that will be used to make prediction into single columns
#The inputCols argument specifies our list of column names we defined earlier, and outputCol is the name of the new column. The second line creates a new DataFrame with the aggregated features in a column.

assembler=VectorAssembler(inputCols=featurescolumn,outputCol="features")
assembled=assembler.transform(binarizerDf)

#assembled.select("features").show(1)

#spliting Train and Test data  by calling randomsplit

(traindata, testdata)=assembled.randomSplit([0.80,0.20],seed=1234)
#data counting

print(traindata.count(),testdata.count())


#create decision trees  Model
#----------------------------------


#The labelCol argument is the column we are trying to predict, featuresCol specifies the aggregated features column, maxDepth is stopping criterion for tree induction based on maximum depth of tree
#minInstancesPerNode is stopping criterion for tree induction based on minimum number of samples in a node
#impurity is the impurity measure used to split nodes.

decisiontree=DecisionTreeClassifier(labelCol="label",featuresCol="features",maxDepth=5,minInstancesPerNode=20,impurity="gini")
print(type(decisiontree))

#creating model by training the decision tree, pipeline solve this
modelpipeline=Pipeline(stages=decisiontree)
model=modelpipeline.fit(traindata)


#predicting test data

predictions=model.transform(testdata)

#showing predictedvalue
prediction=predictions.select('prediction','label').show(5)

该课程在云时代VM中使用spark 1.6。但我已经将Spark 2.1.0与PyCharm集成在一起。

1 个答案:

答案 0 :(得分:2)

stages应该是一系列PipelineStagesTransofmersEsitmators),而不是一个Estimator。替换:

Pipeline(stages=decisiontree)

Pipeline(stages=[decisiontree])