如何将决策树与CSV文件中的数据集一起使用?

时间:2017-05-22 18:02:19

标签: scala apache-spark apache-spark-sql apache-spark-mllib decision-tree

我想使用Spark MLlib的org.apache.spark.mllib.tree.DecisionTree,如下面的代码所示,但编译失败。

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.SparkSession

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val data = sqlContext.read.format("csv").load("C:/spark/spark-2.1.0-bin-hadoop2.7/data/mllib/airlines.txt")
val df = sqlContext.read.csv("C:/spark/spark-2.1.0-bin-hadoop2.7/data/mllib/airlines.txt")
val dataframe = sqlContext.createDataFrame(df).toDF("label");
val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,impurity, maxDepth, maxBins)

编译失败,并显示以下错误消息:

  

< console>:44:错误:重载方法值trainClassifier with   备选方案:(输入:   org.apache.spark.api.java.JavaRDD [org.apache.spark.mllib.regression.LabeledPoint],numClasses:   Int,categoricalFeaturesInfo:java.util.Map [Integer,Integer],impurity:   String,maxDepth:Int,maxBins:   Int)org.apache.spark.mllib.tree.model.DecisionTreeModel
  (输入:   org.apache.spark.rdd.RDD [org.apache.spark.mllib.regression.LabeledPoint],numClasses:   诠释,categoricalFeaturesInfo:   scala.collection.immutable.Map [Int,Int],impurity:String,maxDepth:   Int,maxBins:Int)org.apache.spark.mllib.tree.model.DecisionTreeModel   无法应用   (org.apache.spark.sql.Dataset [org.apache.spark.sql.Row],Int,   scala.collection.immutable.Map [Int,Int],String,Int,Int)          val model = DecisionTree.trainClassifier(trainingData,numClasses,categoricalFeaturesInfo,impurity,maxDepth,maxBins)

1 个答案:

答案 0 :(得分:1)

您使用旧的基于RDD的UIView.animate(withDuration: 5.0, delay: 0, options: [.repeat, .autoreverse], animations: { self.view.frame = CGRect(x: 100, y: 200, width: 200, height: 200) }, completion: nil) 与Spark SQL的新数据集API,因此编译错误:

  

无法应用于(org.apache.spark.sql.Dataset [org.apache.spark.sql.Row],Int,scala.collection.immutable.Map [Int,Int],String,Int,Int) val model = DecisionTree.trainClassifier(trainingData,numClasses,categoricalFeaturesInfo,impurity,maxDepth,maxBins)

请注意DecisionTree类型的第一个输入参数,但org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]需要DecisionTree

引用Announcement: DataFrame-based API is primary API

  

从Spark 2.0开始,spark.mllib包中基于RDD的API已进入维护模式。 Spark的主要机器学习API现在是spark.ml包中基于DataFrame的API。

请根据Decision trees更改您的代码:

  

spark.ml实现支持使用连续和分类特征进行二进制和多类分类以及回归的决策树。该实现按行对数据进行分区,允许数百万甚至数十亿实例的分布式培训。