我怎样才能用Pyspark ml标准化数值变量?

时间:2018-06-05 13:52:29

标签: dataframe pyspark apache-spark-sql

我有一个包含数字和分类数据的数据框。

为了适应随机森林分类器,我会进行下一次转换:

  • 用于索引分类要素的StringIndexer()
  • VectorAssembler()获取密集向量
  • StandarScaler()以标准化数字特征。

问题是我只想标准化“var1”,“var2”和“var3”(数字特征)。问题是StandarScaler()需要一个密集向量作为输入,比如“features”,所以我的问题是:

•有没有办法规范化数字特征?

如果没有......

•将分类数据标准化的功能列是否有意义? 我的意思是,随机森林是否正确理解了分类数据规范化的信息?这样做可以吗?

from pyspark.ml.feature import *
from pyspark.ml.pipeline import *

df = spark.createDataFrame([[11.1,13.2,3,'blue',1],
                            [1.8,45.0,8,'green',1],
                            [5.0,1.78,11,'yellow',1],
                            [0.0,56.3,15,'orange',0],
                            [8.0,14.6,17,'purple',0]],['var1','var2','var3','cat1','label'])
df.show()

+----+----+----+------+-----+
|var1|var2|var3|  cat1|label|
+----+----+----+------+-----+
|11.1|13.2|   3|  blue|    1|
| 1.8|45.0|   8| green|    1|
| 5.0|1.78|  11|yellow|    1|
| 0.0|56.3|  15|orange|    0|
| 8.0|14.6|  17|purple|    0|
+----+----+----+------+-----+

indexer=StringIndexer(inputCol="cat1",outputCol="cat1_index").setHandleInvalid("keep")
assembler = VectorAssembler(inputCols=["var1","var2","var3","cat1_index"], outputCol="assembler")
scaler = MinMaxScaler(inputCol="assembler",outputCol="features")
pipeline = Pipeline(stages=[indexader,assembler,scaler])
df_transf = pipeline.fit(df).transform(df)
df_transf.select("features").show(truncate=False)

+-----------------------------------------------------------------+
|features                                                         |
+-----------------------------------------------------------------+
|[1.0,0.20946441672780633,0.0,0.25]                               |
|[0.16216216216216217,0.7927366104181952,0.35714285714285715,0.75]|
|[0.45045045045045046,0.0,0.5714285714285714,1.0]                 |
|[0.0,1.0,0.8571428571428571,0.5]                                 |
|[0.7207207207207208,0.23514306676449012,1.0,0.0]                 |
+-----------------------------------------------------------------+

0 个答案:

没有答案