How to use StringIndexer to generate numeric variables?

时间:2017-04-10 00:19:36

标签: apache-spark apache-spark-mllib apache-spark-ml

I was hoping to use StringIndexer as a means of ranking the 1000+ categories in my data set, generating an index which signifies relative frequency. I could then use this index as a numeric feature for my model. Unfortunately StringIndex by default stores some metadata flagging the index as categorical, forcing my model to use the index as a category variable.

Is there some way of disabling this, so the index variable can be used as a numeric variable?

Edit: I am using string indexer as a stage in a ML pipeline, so a solution would need to avoid manipulating the data frame directly. Also I will be saving and loading this pipeline, so a custom data transformer may be impractical. I suspect this isn't possible as Spark is currently written.

1 个答案:

答案 0 :(得分:4)

您可以索引数据,然后替换元数据。假设您的数据如下所示:

import spark.implicits._
import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer().setInputCol("raw").setOutputCol("indexed")

val df = Seq("a", "b", "b", "c", "c", "c").toDF("raw")
val indexed = indexer.fit(df).transform(df)

我们需要NumericAttribute

import org.apache.spark.ml.attribute.NumericAttribute

和元数据:

val meta = NumericAttribute.defaultAttr.withName("indexed").toMetadata

最后,我们可以使用as方法替换元数据:

indexed.withColumn("indexed", $"indexed".as("indexed", meta))