PySpark:将RDD [DenseVector]转换为数据帧

时间:2016-09-17 00:00:30

标签: python pyspark apache-spark-sql apache-spark-mllib apache-spark-ml

我有以下RDD:

rdd.take(5)给了我:

[DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
 DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
 DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0]),
 DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699]),
 DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699])]

我想把它变成一个看似如下的数据框:

-------------------------------------------------------------------
| features                                                        |
-------------------------------------------------------------------
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------| 
| [5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0, 4.0, 9.0]             |
|-----------------------------------------------------------------|
| [9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|
| [9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432, 8.3397, 11.7699] |
|-----------------------------------------------------------------|

这可能吗?我尝试使用df_new = sqlContext.createDataFrame(rdd,['features']),但它没有用。有没有人有任何建议?谢谢!

1 个答案:

答案 0 :(得分:4)

首先映射到tuples

rdd.map(lambda x: (x, )).toDF(["features"])

请记住,从Spark 2.0开始,有两种不同的Vector实施,ml算法需要pyspark.ml.Vector