我有一个用户偏好表的数据框:
+------+------------------+------+---------+---------+--------+------+-----+-----------+-----+-------+---------+------+----+-------+-------+-------+------+--------+---+-------+
|userId|(no genres listed)|Action|Adventure|Animation|Children|Comedy|Crime|Documentary|Drama|Fantasy|Film-Noir|Horror|IMAX|Musical|Mystery|Romance|Sci-Fi|Thriller|War|Western|
+------+------------------+------+---------+---------+--------+------+-----+-----------+-----+-------+---------+------+----+-------+-------+-------+------+--------+---+-------+
| 18| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0|
| 65| 0| 9| 4| 0| 4| 12| 8| 1| 15| 6| 4| 0| 0| 0| 2| 7| 7| 10| 0| 0|
| 96| 0| 0| 16| 16| 0| 16| 0| 0| 0| 16| 0| 0| 0| 16| 0| 16| 0| 0| 0| 0|
| 121| 0| 8| 0| 0| 0| 69| 9| 0| 21| 0| 0| 0| 0| 0| 0| 15| 0| 5| 5| 0|
| 129| 0| 11| 14| 0| 3| 85| 4| 4| 46| 3| 0| 2| 3| 0| 19| 28| 11| 17| 8| 0|
+------+------------------+------+---------+---------+--------+------+-----+-----------+-----+-------+---------+------+----+-------+-------+-------+------+--------+---+-------+
如何对此进行行标准化?所以我得到以下格式的数据框:
([[0. , 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0.31799936, 0.14133305, ..., 0.35333263, 0. , 0. ],
[0. , 0. , 0.40824829, ..., 0. , 0. , 0. ],
...,
[0. , 0. , 0. , ..., 0. , 0. , 0. ],
[0. , 0.06311944, 0.1577986 , ..., 0. , 0. , 0. ],
[0. , 0. , 0. , ..., 0. , 0. , 0. ]])
如何通过表的相应userId值访问该表的每一行?
答案 0 :(得分:1)
您可以使用向量汇编器,然后使用pyspark ml的minmaxScaler。请参阅此文档:https://spark.apache.org/docs/2.4.0/api/python/pyspark.ml.html?highlight=minmaxscaler#pyspark.ml.feature.MinMaxScaler
from pyspark.ml import Pipeline,PipelineModel
from pyspark.ml.feature import VectorAssembler,MinMaxScaler
from pyspark.sql.types import *
#%%
df = sqlContext.createDataFrame([(1,2,3,4,5,6,7,8,9,10),(2,3,4,5,6,7,8,9,10,11)],schema=["a","b","c","d","e","f","g","h","i","label"])
#%%
vecAssembler = VectorAssembler(inputCols=[x for x in df.columns if x not in 'label'], outputCol="features",handleInvalid='skip')
normalizer = MinMaxScaler(inputCol="features", outputCol="scaledFeatures",min=0,max=1)
pipeline_test = Pipeline(stages=[vecAssembler,normalizer])
pipeline_trained = pipeline_test.fit(df)
results = pipeline_trained.transform(df)