我正在使用pyspark,我想从Mllib包中执行线性回归。所以我想生成自己的(大)数据来比较我的集群性能与单节点python解释器。
from pyspark.mllib.random import RandomRDDs
u=RandomRDDs.normalVectorRDD(sc, 1000000000, 500)
u.take(5)
我明白了:
array([ -1.13787491e+00, 3.68202613e-01, 9.59762136e-01,
6.33172122e-01, -1.91278957e+00, -1.17794680e+00,
-7.77179759e-01, -1.48368585e+00, 2.32369644e+00,...]
我想将其解析为LabeledPoint数据,因此可以通过LinearregressionwithSGD算法识别它。每一行都是这样的:
LabeledPoint(0.469112,[-0.282863,-1.509059,-1.135632,1.212112,-0.173215,0.119209,-1.044236,-0.861849,-2.104569,-0.494929,1.071804,0.721555,-0.706771,-1.039575,0.27186,-0.424972,0.56702,0.276232,-1.087401,-0.67369,0.113648,-1.478427,0.524988,0.404705])
第一个值作为目标或标签,其余作为特征。
答案 0 :(得分:1)
试试这个,
from pyspark.mllib.regression import LabeledPoint
u.map(lambda x:LabeledPoint(x[0],x[1:]))