如何在pyspark数据帧上执行按行数据标准化?

时间:2020-07-30 09:09:12

标签: pyspark

我有一个用户偏好表的数据框:

+------+------------------+------+---------+---------+--------+------+-----+-----------+-----+-------+---------+------+----+-------+-------+-------+------+--------+---+-------+
|userId|(no genres listed)|Action|Adventure|Animation|Children|Comedy|Crime|Documentary|Drama|Fantasy|Film-Noir|Horror|IMAX|Musical|Mystery|Romance|Sci-Fi|Thriller|War|Western|
+------+------------------+------+---------+---------+--------+------+-----+-----------+-----+-------+---------+------+----+-------+-------+-------+------+--------+---+-------+
|    18|                 0|     0|        0|        0|       0|     1|    0|          0|    0|      0|        0|     0|   0|      0|      0|      1|     0|       0|  0|      0|
|    65|                 0|     9|        4|        0|       4|    12|    8|          1|   15|      6|        4|     0|   0|      0|      2|      7|     7|      10|  0|      0|
|    96|                 0|     0|       16|       16|       0|    16|    0|          0|    0|     16|        0|     0|   0|     16|      0|     16|     0|       0|  0|      0|
|   121|                 0|     8|        0|        0|       0|    69|    9|          0|   21|      0|        0|     0|   0|      0|      0|     15|     0|       5|  5|      0|
|   129|                 0|    11|       14|        0|       3|    85|    4|          4|   46|      3|        0|     2|   3|      0|     19|     28|    11|      17|  8|      0|
+------+------------------+------+---------+---------+--------+------+-----+-----------+-----+-------+---------+------+----+-------+-------+-------+------+--------+---+-------+

如何对此进行行标准化?所以我得到以下格式的数据框:

([[0.        , 0.        , 0.        , ..., 0.        , 0.        , 0.        ],
  [0.        , 0.31799936, 0.14133305, ..., 0.35333263, 0.        , 0.        ],
  [0.        , 0.        , 0.40824829, ..., 0.        , 0.        , 0.        ],
       ...,
  [0.        , 0.        , 0.        , ..., 0.        , 0.        , 0.        ],
  [0.        , 0.06311944, 0.1577986 , ..., 0.        , 0.        , 0.        ],
  [0.        , 0.        , 0.        , ..., 0.        , 0.        , 0.        ]])

如何通过表的相应userId值访问该表的每一行?

1 个答案:

答案 0 :(得分:1)

您可以使用向量汇编器,然后使用pyspark ml的minmaxScaler。请参阅此文档:https://spark.apache.org/docs/2.4.0/api/python/pyspark.ml.html?highlight=minmaxscaler#pyspark.ml.feature.MinMaxScaler

from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.feature import VectorAssembler,MinMaxScaler
from pyspark.sql.types import *
#%%
df = sqlContext.createDataFrame([(1,2,3,4,5,6,7,8,9,10),(2,3,4,5,6,7,8,9,10,11)],schema=["a","b","c","d","e","f","g","h","i","label"])
#%%
vecAssembler = VectorAssembler(inputCols=[x for x in df.columns if x not in 'label'], outputCol="features",handleInvalid='skip')
normalizer = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=1)
pipeline_test = Pipeline(stages=[vecAssembler,normalizer])    
pipeline_trained = pipeline_test.fit(df)    
results = pipeline_trained.transform(df)