下面是我的pyspark代码,用于使用示例数据集创建PCA模型。但是,此代码会为整个数据集生成 ONE 全局模型。但是,我想创建多个模型,即每个“用户ID”一个PCA模型(这类似于个性化模型)。
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import PCA
#create a dataframe
data = spark.createDataFrame([('a',1,2,3),('a',2,3,4),('b',4,5,6),('b',6,7,8),('c',8,0,9),('c',10,12,3)],["userid","f1","f2","f3"])
#create a feature vector
cols = data.drop('userid').columns
assembler = VectorAssembler(inputCols=cols, outputCol = 'features')
output_dat = assembler.transform(data).select('UserId', 'features')
output_dat.show(5, truncate = False)
#Scale the feature vector
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", withStd=False, withMean=True)
# Compute summary statistics by fitting the StandardScaler
scalerModel = scaler.fit(output_dat)
scaledData = scalerModel.transform(output_dat)
scaledData.select(['UserId', 'scaledFeatures']).show(5, truncate = False)
#Fit PCA
pca = PCA(k=2, inputCol = scaler.getOutputCol(), outputCol="pcaFeatures")
model = pca.fit(scaledData)
transformed_feature = model.transform(scaledData)
#Output
transformed_feature.select('userid','pcaFeatures').show(10, truncate = False)
根据我目前的知识,我们有两种选择可以在pypark中生成多PCA模型
有人可以帮助pyspark中的代码将其转换为训练多个PCA模型吗?