Question

I was hoping to use StringIndexer as a means of ranking the 1000+ categories in my data set, generating an index which signifies relative frequency. I could then use this index as a numeric feature for my model. Unfortunately StringIndex by default stores some metadata flagging the index as categorical, forcing my model to use the index as a category variable.

Is there some way of disabling this, so the index variable can be used as a numeric variable?

Edit: I am using string indexer as a stage in a ML pipeline, so a solution would need to avoid manipulating the data frame directly. Also I will be saving and loading this pipeline, so a custom data transformer may be impractical. I suspect this isn't possible as Spark is currently written.

Answer 1

您可以索引数据，然后替换元数据。假设您的数据如下所示：

import spark.implicits._
import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer().setInputCol("raw").setOutputCol("indexed")

val df = Seq("a", "b", "b", "c", "c", "c").toDF("raw")
val indexed = indexer.fit(df).transform(df)

我们需要NumericAttribute：

import org.apache.spark.ml.attribute.NumericAttribute

和元数据：

val meta = NumericAttribute.defaultAttr.withName("indexed").toMetadata

最后，我们可以使用as方法替换元数据：

indexed.withColumn("indexed", $"indexed".as("indexed", meta))

How to use StringIndexer to generate numeric variables?

1 个答案: