Spark ML VectorAssembler()处理数据帧

时间:2016-04-11 18:24:03

标签: scala apache-spark classification pipeline

我正在使用spark ML管道在真正的宽桌上设置分类模型。这意味着我必须自动生成处理列的所有代码,而不是精确地键入每个列。我几乎是scala和spark的初学者。当我尝试执行以下操作时,我被卡在了VectorAssembler()部分:

val featureHeaders = featureHeader.collect.mkString(" ")
//convert the header RDD into a string
val featureArray = featureHeaders.split(",").toArray
val quote = "\""
val featureSIArray = featureArray.map(x => (s"$quote$x$quote"))
//count the element in headers
val featureHeader_cnt = featureHeaders.split(",").toList.length


// Fit on whole dataset to include all labels in index.
import org.apache.spark.ml.feature.StringIndexer
val labelIndexer = new StringIndexer().
  setInputCol("target").
  setOutputCol("indexedLabel")

val featureAssembler = new VectorAssembler().
  setInputCols(featureSIArray).
  setOutputCol("features")

val convpipeline = new Pipeline().
  setStages(Array(labelIndexer, featureAssembler))

val myFeatureTransfer = convpipeline.fit(df)

显然它不起作用。我不知道我该怎么做才能使整个事情更加自动化,或者ML管道在这个时刻不会占用那么多列(我怀疑)?

3 个答案:

答案 0 :(得分:0)

除非列名包含引号,否则您不应使用引号(RPUSH)。尝试

s"$quote$x$quote"

答案 1 :(得分:0)

我终于找到了一种不太漂亮的方法。它是为要素创建vector.dense,然后从中创建数据框。

import org.apache.spark.mllib.regression.LabeledPoint
val myDataRDDLP = inputData.map {line => 
 val indexed = line.split('\t').zipWithIndex 
 val myValues = indexed.filter(x=> {x._2 >1770}).map(x=>x._1).map(_.toDouble)
 val mykey = indexed.filter(x=> {x._2 == 3}).map(x=>(x._1.toDouble-1)).mkString.toDouble
 LabeledPoint(mykey,  Vectors.dense(myValues))
}
 val training = sqlContext.createDataFrame(myDataRDDLP).toDF("label", "features")

答案 2 :(得分:0)

对于pyspark,您可以先创建一个列名列表:

df_colnames = df.columns

然后你可以在vectorAssembler中使用它:

assemble = VectorAssembler(inputCols = df_colnames, outputCol = 'features')
df_vectorized = assemble.transform(df)