如何在多列上使用spark quantilediscretizer

时间:2017-04-26 16:01:28

标签: scala dictionary apache-spark pipeline quantile

所有

我有一个ml管道设置如下

import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.sql.types.{StructType,StructField,DoubleType}    
import org.apache.spark.ml.Pipeline
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import scala.util.Random

val nRows = 10000
val nCols = 1000
val data = sc.parallelize(0 to nRows-1).map { _ => Row.fromSeq(Seq.fill(nCols)(Random.nextDouble)) }
val schema = StructType((0 to nCols-1).map { i => StructField("C" + i, DoubleType, true) } )
val df = spark.createDataFrame(data, schema)
df.cache()

//Get continuous feature name and discretize them
val continuous = df.dtypes.filter(_._2 == "DoubleType").map (_._1)
val discretizers = continuous.map(c => new QuantileDiscretizer().setInputCol(c).setOutputCol(s"${c}_disc").setNumBuckets(3).fit(df))
val pipeline = new Pipeline().setStages(discretizers)
val model = pipeline.fit(df)

当我运行时,火花似乎将每个离散器设置为一个单独的工作。有没有办法将所有离散机作为单个作业运行,有或没有管道? 感谢您的帮助,感激不尽。

2 个答案:

答案 0 :(得分:3)

Spark 2.3.0中添加了对此功能的支持。 See release docs

  • 多个功能变换器的多列支持:
    • [SPARK-13​​030]:OneHotEncoderEstimator(Scala / Java / Python)
    • [SPARK-22397]:QuantileDiscretizer(Scala / Java)
    • [SPARK-20542]:Bucketizer(Scala / Java / Python)

现在,您可以使用setInputColssetOutputCols指定多个列,但似乎尚未在官方文档中反映出来。与每次处理每个列一个作业相比,这个新补丁的性能大大提高了。

您的示例可能会改编如下:

import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.sql.types.{StructType,StructField,DoubleType}    
import org.apache.spark.ml.Pipeline
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import scala.util.Random

val nRows = 10000
val nCols = 1000
val data = sc.parallelize(0 to nRows-1).map { _ => Row.fromSeq(Seq.fill(nCols)(Random.nextDouble)) }
val schema = StructType((0 to nCols-1).map { i => StructField("C" + i, DoubleType, true) } )
val df = spark.createDataFrame(data, schema)
df.cache()

//Get continuous feature name and discretize them
val continuous = df.dtypes.filter(_._2 == "DoubleType").map (_._1)

val discretizer = new QuantileDiscretizer()
  .setInputCols(continuous)
  .setOutputCols(continuous.map(c => s"${c}_disc"))
  .setNumBuckets(3)

val pipeline = new Pipeline().setStages(Array(discretizer))
val model = pipeline.fit(df)
model.transform(df)

答案 1 :(得分:0)

import org.apache.spark.ml.feature.QuantileDiscretizer

val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
val df = spark.createDataFrame(data).toDF("id", "hour")

val discretizer = new QuantileDiscretizer()
.setInputCol("hour")
.setOutputCol("result")
.setNumBuckets(3)

val result = discretizer.fit(df).transform(df)
result.show()

取自quantilediscretizer

它作为单个列的单个作业运行,在它下面也作为单个作业运行但是对于多个列:

def discretizerFun (col: String, bucketNo: Int): 
 org.apache.spark.ml.feature.QuantileDiscretizer = {

val discretizer = new QuantileDiscretizer()

discretizer
.setInputCol(col)
.setOutputCol(s"${col}_result")
.setNumBuckets(bucketNo)
}


val data = Array((0, 18.0, 2.1), (1, 19.0, 14.1), (2, 8.0, 63.7), (3, 5.0, 
88.3), (4, 2.2, 0.8))

val df = spark.createDataFrame(data).toDF("id", "hour", "temp")

df.show

val res = discretizerFun("temp", 4).fit(discretizerFun("hour", 2).fit(df).transform(df)).transform(discretizerFun("hour", 2).fit(df).transform(df))

df.show

最好的方法是将该函数转换为udf但是它可能是处理org.apache.spark.ml.feature.QuantileDiscretizer - type的问题,如果可以的话,那么你将会有一个干净利落的方式做懒惰的转变