我注意到ml StandardScaler
没有将元数据附加到输出列:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._
val df = spark.read.option("header", true)
.option("inferSchema", true)
.csv("/path/to/cars.data")
val strId1 = new StringIndexer()
.setInputCol("v7")
.setOutputCol("v7_IDX")
val strId2 = new StringIndexer()
.setInputCol("v8")
.setOutputCol("v8_IDX")
val assmbleFeatures: VectorAssembler = new VectorAssembler()
.setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX"))
.setOutputCol("featuresRaw")
val scalerModel = new StandardScaler()
.setInputCol("featuresRaw")
.setOutputCol("scaledFeatures")
val plm = new Pipeline()
.setStages(Array(strId1, strId2, assmbleFeatures, scalerModel))
.fit(df)
val dft = plm.transform(df)
dft.schema("scaledFeatures").metadata
给予:
res1: org.apache.spark.sql.types.Metadata = {}
此示例适用于this dataset(只需调整上面代码中的路径)。
有具体原因吗?将来可能会将此功能添加到Spark中吗?有关解决方法的任何建议,不包括重复StandardScaler?
答案 0 :(得分:2)
虽然丢弃元数据可能不是最幸运的选择,但缩放索引分类功能没有任何意义。 StringIndexer
返回的值只是标签。
如果要缩放数字要素,它应该是一个单独的阶段:
val numericAssembler: VectorAssembler = new VectorAssembler()
.setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6"))
.setOutputCol("numericFeatures")
val scaler = new StandardScaler()
.setInputCol("numericFeatures")
.setOutputCol("scaledNumericFeatures")
val finalAssembler: VectorAssembler = new VectorAssembler()
.setInputCols(Array("scaledNumericFeatures", "v7_IDX"))
.setOutputCol("features")
new Pipeline()
.setStages(Array(strId1, strId2, numericAssembler, scaler, finalAssembler))
.fit(df)
请记住在本回答开头提出的问题,您也可以尝试复制元数据:
val result = plm.transform(df).transform(df =>
df.withColumn(
"scaledFeatures",
$"scaledFeatures".as(
"scaledFeatures",
df.schema("featuresRaw").metadata)))
esult.schema("scaledFeatures").metadata
{"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"v0"},{"idx":1,"name":"v1"},{"idx":2,"name":"v2"},{"idx":3,"name":"v3"},{"idx":4,"name":"v4"},{"idx":5,"name":"v5"},{"idx":6,"name":"v6"}],"nominal":[{"vals":["ford","chevrolet","plymouth","dodge","amc","toyota","datsun","vw","buick","pontiac","honda","mazda","mercury","oldsmobile","peugeot","fiat","audi","chrysler","volvo","opel","subaru","saab","mercedes","renault","cadillac","bmw","triumph","hi","capri","nissan"],"idx":7,"name":"v7_IDX"}]},"num_attrs":8}}