使用scala / spark 1.6推广从RDD [Vector]到DataFrame的转换的最佳解决方案是什么? 输入是不同的RDD [Vector]。 对于不同的RDD,Vector中的列号可以从1到n。
我尝试使用无形库,将它们命名为需要声明的列数和类型。 ES:
val df = rddVector.map(_.toArray.toList)
.collect {
case t: List[Double] if t.length == 3 => t.toHList[Double :: Double :: Double :: HNil].get.tupled.productArity
.toDF( "column_1", "column_2", "column_3" )
答案 0 :(得分:3)
// Create a vector rdd
val vectorRDD = sc.parallelize(Seq(Seq(123L, 345L), Seq(567L, 789L), Seq(567L, 789L, 233334L))).
map(s => Vectors.dense(s.toSeq.map(_.toString.toDouble).toArray))
// Calculate the maximum length of the vector to create a schema
val vectorLength = vectorRDD.map(x => x.toArray.length).max()
// create the dynamic schema
var schema = new StructType()
var i = 0
while (i < vectorLength) {
schema = schema.add(StructField(s"val${i}", DoubleType, true))
i = i + 1
// create a rowRDD variable and make each row have the same arity
val rowRDD = vectorRDD.map { x =>
var row = new Array[Double](vectorLength)
val newRow = x.toArray
System.arraycopy(newRow, 0, row, 0, newRow.length);
// create your dataframe
val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
|-- val0: double (nullable = true)
|-- val1: double (nullable = true)
|-- val2: double (nullable = true)
| val0| val1| val2|
|123.0|345.0| 0.0|
|567.0|789.0| 0.0|