如何将具有稀疏数据的PythonRDD转换为密集的PythonRDD

时间:2016-05-21 04:26:14

标签: python apache-spark pyspark apache-spark-mllib

我想使用StandardScaler来扩展数据。我已将数据加载到PythonRDD中。似乎数据稀少。要应用StandardScaler,我们应该首先将其转换为密集类型。

trainData = MLUtils.loadLibSVMFile(sc, trainDataPath)
valData = MLUtils.loadLibSVMFile(sc, valDataPath) 
trainLabel = trainData.map(lambda x: x.label)
trainFeatures = trainData.map(lambda x: x.features)
valLabel = valData.map(lambda x: x.label)
valFeatures = valData.map(lambda x: x.features)
scaler = StandardScaler(withMean=True, withStd=True).fit(trainFeatures)

# apply the scaler into the data. Here, trainFeatures is a sparse PythonRDD, we first convert it into dense tpye
trainFeatures_scaled = scaler.transform(trainFeatures)
valFeatures_scaled = scaler.transform(valFeatures)    

# merge `trainLabel` and `traiFeatures_scaled` into a new PythonRDD
trainData1 = ...
valData1 = ...

# using the scaled data, i.e., trainData1 and valData1 to train a model
...

上面的代码有错误。我有两个问题:

  1. 如何将稀疏的PythonRDD trainFeatures转换为密集的tpye,可以作为StandardScaler的输入?
  2. 如何将trainLabeltrainFeatures_scaled合并到可用于训练分类器(例如随机森林)的新LabeledPoint中?
  3. 我仍然可以找到有关此问题的任何文件或参考资料。

1 个答案:

答案 0 :(得分:2)

使用toArray转换为密集地图:

dense = valFeatures.map(lambda v: DenseVector(v.toArray()))

合并zip:

valLabel.zip(dense).map(lambda (l, f): LabeledPoint(l, f))