在Spark-Scala中将List或RDD转换为DataFrame

时间:2017-06-13 19:36:53

标签: scala apache-spark dataframe rdd

所以基本上我想要实现的是 - 我有一个包含4列的表(比方说),我将它暴露给DataFrame - DF1。现在我想将DF1的每一行存储到另一个hive表(基本上DF2,其架构为 - Column1,Column2,Column3),而column3值将是' - ' DataFrame DF1的分隔行。

val df = hiveContext.sql("from hive_table SELECT *")
val writeToHiveDf = df.filter(new Column("id").isNotNull)

var builder : List[(String, String, String)] = Nil
    var finalOne  =  new ListBuffer[List[(String, String, String)]]()
    writeToHiveDf.rdd.collect().foreach {
      row =>
        val item = row.mkString("-@")
        builder = List(List("dummy", "NEVER_NULL_CONSTRAINT", "some alpha")).map{case List(a,b,c) => (a,b,c)}
        finalOne += builder
    }

现在我将 finalOne 作为列表列表,我想直接或通过RDD转换为数据帧。

var listRDD = sc.parallelize(finalOne) //Converts to RDD - It works. 
val dataFrameForHive : DataFrame = listRDD.toDF("table_name", "constraint_applied", "data") //Doesn't work

错误:

java.lang.ClassCastException: org.apache.spark.sql.types.ArrayType cannot be cast to org.apache.spark.sql.types.StructType
    at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:414)
    at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:94)

有人可以帮助我理解将其转换为DataFrame的正确方法。非常感谢您的支持。

2 个答案:

答案 0 :(得分:1)

如果您希望数据框中有3列类型字符串,则应将List[List[(String,String,String)]]展平为List[(String,String,String)]

var listRDD = sc.parallelize(finalOne.flatten) // makes List[(String,String,String)]
val dataFrameForHive : DataFrame = listRDD.toDF("table_name", "constraint_applied", "data") 

答案 1 :(得分:0)

我相信在将“ finalOne”数据帧传递到sc.parallelize()函数之前对其进行展平应该会产生与您期望的结果一致的结果。

var listRDD = sc.parallelize(finalOne)

val dataFrameForHive : DataFrame = listRDD.toDF("table_name", "constraint_applied", "data")