在scala中为MultilayerPerceptronClassifier准备数据

时间:2015-11-21 14:41:22

标签: scala apache-spark transformation

请记住,我是斯卡拉的新手。

这是我想要遵循的例子: https://spark.apache.org/docs/1.5.1/ml-ann.html

它使用此数据集: https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt

我已经使用下面的代码编写了我的.csv来获取Scala中的分类数据框。

//imports for ML
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row

//imports for transformation
import sqlContext.implicits._
import com.databricks.spark.csv._
import org.apache.spark.mllib.linalg.{Vector, Vectors}

//load data
val data2 = sqlContext.csvFile("/Users/administrator/Downloads/ds_15k_10-2.csv")

//Rename any one column to features
//val df2 = data.withColumnRenamed("ip_crowding", "features")
val DF2 = data2.select("gst_id_matched","ip_crowding","lat_long_dist");

scala> DF2.take(2)
res6: Array[org.apache.spark.sql.Row] = Array([0,0,0], [0,0,1628859.542])

//define doublelfunc
val toDouble = udf[Double, String]( _.toDouble)

//Convert all to double
val featureDf = DF2
.withColumn("gst_id_matched",toDouble(DF2("gst_id_matched")))
.withColumn("ip_crowding",toDouble(DF2("ip_crowding")))
.withColumn("lat_long_dist",toDouble(DF2("lat_long_dist")))
.select("gst_id_matched","ip_crowding","lat_long_dist")


//Define the format
val toVec4 = udf[Vector, Double,Double] { (v1,v2) => Vectors.dense(v1,v2) }

//Format for features which is gst_id_matched
val encodeLabel   = udf[Double, String]( _ match 
{ case "0.0" => 0.0 case "1.0" => 1.0} )

//Transformed dataset
    val df = featureDf
.withColumn("features",toVec4(featureDf("ip_crowding"),featureDf("lat_long_dist")))
.withColumn("label",encodeLabel(featureDf("gst_id_matched")))
.select("label", "features")

val splits = df.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network: 
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](0, 0, 0, 0)
// create the trainer and set its parameter


val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(12).setSeed(1234L).setMaxIter(10)
// train the model
val model = trainer.fit(train)

最后一行产生此错误

15/11/21 22:46:23 ERROR Executor: Exception in task 1.0 in stage 11.0 (TID 15)
java.lang.ArrayIndexOutOfBoundsException: 0

我的怀疑:

当我检查数据集时,它看起来很适合分类

scala> df.take(2)
res3: Array[org.apache.spark.sql.Row] = Array([0.0,[0.0,0.0]], [0.0,[0.0,1628859.542]])

但是apache示例数据集是不同的,我的转换并没有给我我需要的东西。可以请一些人帮我完成数据集转换或了解问题的根本原因。

这就是apache数据集的样子:

scala> data.take(1)
res8: Array[org.apache.spark.sql.Row] = Array([1.0,(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])])

1 个答案:

答案 0 :(得分:6)

问题的根源是图层的错误定义。当你使用

val layers = Array[Int](0, 0, 0, 0)

这意味着你想要一个每层都有零节点的网络,这根本没有意义。一般来说,输入层中神经元的数量应该等于特征的数量,每个隐藏层应该包含至少一个神经元。

让我们在途中重新创建您的代码:

import org.apache.spark.sql.functions.col

val df = sc.parallelize(Seq(
  ("0", "0", "0"), ("0", "0", "1628859.542")
)).toDF("gst_id_matched", "ip_crowding", "lat_long_dist")

将所有列转换为双精度数:

val numeric = df
  .select(df.columns.map(c => col(c).cast("double").alias(c)): _*)
  .withColumnRenamed("gst_id_matched", "label")

组装功能:

import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
  .setInputCols(Array("ip_crowding","lat_long_dist"))
  .setOutputCol("features")

val data = assembler.transform(numeric)
data.show

// +-----+-----------+-------------+-----------------+
// |label|ip_crowding|lat_long_dist|         features|
// +-----+-----------+-------------+-----------------+
// |  0.0|        0.0|          0.0|        (2,[],[])|
// |  0.0|        0.0|  1628859.542|[0.0,1628859.542]|
// +-----+-----------+-------------+-----------------+

训练和测试网络:

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier

val layers = Array[Int](2, 3, 5, 3) // Note 2 neurons in the input layer
val trainer = new MultilayerPerceptronClassifier()
  .setLayers(layers)
  .setBlockSize(128)
  .setSeed(1234L)
  .setMaxIter(100)

val model = trainer.fit(data)
model.transform(data).show

// +-----+-----------+-------------+-----------------+----------+
// |label|ip_crowding|lat_long_dist|         features|prediction|
// +-----+-----------+-------------+-----------------+----------+
// |  0.0|        0.0|          0.0|        (2,[],[])|       0.0|
// |  0.0|        0.0|  1628859.542|[0.0,1628859.542]|       0.0|
// +-----+-----------+-------------+-----------------+----------+