请记住,我是斯卡拉的新手。
这是我想要遵循的例子: https://spark.apache.org/docs/1.5.1/ml-ann.html
它使用此数据集: https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt
我已经使用下面的代码编写了我的.csv来获取Scala中的分类数据框。
//imports for ML
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
//imports for transformation
import sqlContext.implicits._
import com.databricks.spark.csv._
import org.apache.spark.mllib.linalg.{Vector, Vectors}
//load data
val data2 = sqlContext.csvFile("/Users/administrator/Downloads/ds_15k_10-2.csv")
//Rename any one column to features
//val df2 = data.withColumnRenamed("ip_crowding", "features")
val DF2 = data2.select("gst_id_matched","ip_crowding","lat_long_dist");
scala> DF2.take(2)
res6: Array[org.apache.spark.sql.Row] = Array([0,0,0], [0,0,1628859.542])
//define doublelfunc
val toDouble = udf[Double, String]( _.toDouble)
//Convert all to double
val featureDf = DF2
.withColumn("gst_id_matched",toDouble(DF2("gst_id_matched")))
.withColumn("ip_crowding",toDouble(DF2("ip_crowding")))
.withColumn("lat_long_dist",toDouble(DF2("lat_long_dist")))
.select("gst_id_matched","ip_crowding","lat_long_dist")
//Define the format
val toVec4 = udf[Vector, Double,Double] { (v1,v2) => Vectors.dense(v1,v2) }
//Format for features which is gst_id_matched
val encodeLabel = udf[Double, String]( _ match
{ case "0.0" => 0.0 case "1.0" => 1.0} )
//Transformed dataset
val df = featureDf
.withColumn("features",toVec4(featureDf("ip_crowding"),featureDf("lat_long_dist")))
.withColumn("label",encodeLabel(featureDf("gst_id_matched")))
.select("label", "features")
val splits = df.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](0, 0, 0, 0)
// create the trainer and set its parameter
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(12).setSeed(1234L).setMaxIter(10)
// train the model
val model = trainer.fit(train)
最后一行产生此错误
15/11/21 22:46:23 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 0
我的怀疑:
当我检查数据集时,它看起来很适合分类
scala> df.take(2)
res3: Array[org.apache.spark.sql.Row] = Array([0.0,[0.0,0.0]], [0.0,[0.0,1628859.542]])
但是apache示例数据集是不同的,我的转换并没有给我我需要的东西。可以请一些人帮我完成数据集转换或了解问题的根本原因。
这就是apache数据集的样子:
scala> data.take(1)
res8: Array[org.apache.spark.sql.Row] = Array([1.0,(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])])
答案 0 :(得分:6)
问题的根源是图层的错误定义。当你使用
val layers = Array[Int](0, 0, 0, 0)
这意味着你想要一个每层都有零节点的网络,这根本没有意义。一般来说,输入层中神经元的数量应该等于特征的数量,每个隐藏层应该包含至少一个神经元。
让我们在途中重新创建您的代码:
import org.apache.spark.sql.functions.col
val df = sc.parallelize(Seq(
("0", "0", "0"), ("0", "0", "1628859.542")
)).toDF("gst_id_matched", "ip_crowding", "lat_long_dist")
将所有列转换为双精度数:
val numeric = df
.select(df.columns.map(c => col(c).cast("double").alias(c)): _*)
.withColumnRenamed("gst_id_matched", "label")
组装功能:
import org.apache.spark.ml.feature.VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("ip_crowding","lat_long_dist"))
.setOutputCol("features")
val data = assembler.transform(numeric)
data.show
// +-----+-----------+-------------+-----------------+
// |label|ip_crowding|lat_long_dist| features|
// +-----+-----------+-------------+-----------------+
// | 0.0| 0.0| 0.0| (2,[],[])|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]|
// +-----+-----------+-------------+-----------------+
训练和测试网络:
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
val layers = Array[Int](2, 3, 5, 3) // Note 2 neurons in the input layer
val trainer = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setSeed(1234L)
.setMaxIter(100)
val model = trainer.fit(data)
model.transform(data).show
// +-----+-----------+-------------+-----------------+----------+
// |label|ip_crowding|lat_long_dist| features|prediction|
// +-----+-----------+-------------+-----------------+----------+
// | 0.0| 0.0| 0.0| (2,[],[])| 0.0|
// | 0.0| 0.0| 1628859.542|[0.0,1628859.542]| 0.0|
// +-----+-----------+-------------+-----------------+----------+