I was hoping to use StringIndexer
as a means of ranking the 1000+ categories in my data set, generating an index which signifies relative frequency. I could then use this index as a numeric feature for my model. Unfortunately StringIndex
by default stores some metadata flagging the index as categorical, forcing my model to use the index as a category variable.
Is there some way of disabling this, so the index variable can be used as a numeric variable?
Edit: I am using string indexer as a stage in a ML pipeline, so a solution would need to avoid manipulating the data frame directly. Also I will be saving and loading this pipeline, so a custom data transformer may be impractical. I suspect this isn't possible as Spark is currently written.
答案 0 :(得分:4)
您可以索引数据,然后替换元数据。假设您的数据如下所示:
import spark.implicits._
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer().setInputCol("raw").setOutputCol("indexed")
val df = Seq("a", "b", "b", "c", "c", "c").toDF("raw")
val indexed = indexer.fit(df).transform(df)
我们需要NumericAttribute
:
import org.apache.spark.ml.attribute.NumericAttribute
和元数据:
val meta = NumericAttribute.defaultAttr.withName("indexed").toMetadata
最后,我们可以使用as
方法替换元数据:
indexed.withColumn("indexed", $"indexed".as("indexed", meta))