我有一个包含数字和分类数据的数据框。
为了适应随机森林分类器,我会进行下一次转换:
问题是我只想标准化“var1”,“var2”和“var3”(数字特征)。问题是StandarScaler()需要一个密集向量作为输入,比如“features”,所以我的问题是:
•有没有办法规范化数字特征?
如果没有......
•将分类数据标准化的功能列是否有意义? 我的意思是,随机森林是否正确理解了分类数据规范化的信息?这样做可以吗?
from pyspark.ml.feature import *
from pyspark.ml.pipeline import *
df = spark.createDataFrame([[11.1,13.2,3,'blue',1],
[1.8,45.0,8,'green',1],
[5.0,1.78,11,'yellow',1],
[0.0,56.3,15,'orange',0],
[8.0,14.6,17,'purple',0]],['var1','var2','var3','cat1','label'])
df.show()
+----+----+----+------+-----+
|var1|var2|var3| cat1|label|
+----+----+----+------+-----+
|11.1|13.2| 3| blue| 1|
| 1.8|45.0| 8| green| 1|
| 5.0|1.78| 11|yellow| 1|
| 0.0|56.3| 15|orange| 0|
| 8.0|14.6| 17|purple| 0|
+----+----+----+------+-----+
indexer=StringIndexer(inputCol="cat1",outputCol="cat1_index").setHandleInvalid("keep")
assembler = VectorAssembler(inputCols=["var1","var2","var3","cat1_index"], outputCol="assembler")
scaler = MinMaxScaler(inputCol="assembler",outputCol="features")
pipeline = Pipeline(stages=[indexader,assembler,scaler])
df_transf = pipeline.fit(df).transform(df)
df_transf.select("features").show(truncate=False)
+-----------------------------------------------------------------+
|features |
+-----------------------------------------------------------------+
|[1.0,0.20946441672780633,0.0,0.25] |
|[0.16216216216216217,0.7927366104181952,0.35714285714285715,0.75]|
|[0.45045045045045046,0.0,0.5714285714285714,1.0] |
|[0.0,1.0,0.8571428571428571,0.5] |
|[0.7207207207207208,0.23514306676449012,1.0,0.0] |
+-----------------------------------------------------------------+