将Rdd转换为dataframe

时间:2016-11-09 19:01:55

标签: scala apache-spark dataframe rdd

我有这样的RDD:SCNLevelOfDetail 我只想将其转换为DataFrame。因此,我使用此架构

RDD[(Any, Array[(Any, Any)])]

我的rdd看起来像这样:

val schema = StructType(Array (StructField("C1", StringType, true), StructField("C4", ArrayType(StringType, false), false)))

val df = Seq(
  ("A",1,"12/06/2012"),
  ("A",2,"13/06/2012"),
  ("B",3,"12/06/2012"),
  ("B",4,"17/06/2012"),
  ("C",5,"14/06/2012")).toDF("C1", "C2","C3")
df.show(false)

val rdd = df.map( line => ( line(0), (line(1), line(2))))
  .groupByKey()
  .mapValues(i => i.toList).foreach(println)

val output_df = sqlContext.createDataFrame(rdd, schema)

或者像这样

(B,List((3,12/06/2012), (4,17/06/2012)))    
(A,List((1,12/06/2012), (2,13/06/2012)))    
(C,List((5,14/06/2012)))

如果我使用:

(A,[Lscala.Tuple2;@3e8f27c9)
(C,[Lscala.Tuple2;@6f22defb)
(B,[Lscala.Tuple2;@1b8692ec)

我已经尝试过了:

.mapValues(i => i.toArray)

但我明白了:

val output_df = sqlContext.createDataFrame(rdd, schema)
对拉斐尔罗斯来说  尝试了第二种不起作用的方法,我得到了:

Error:(40, 32) overloaded method value createDataFrame with alternatives:
  (data: java.util.List[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
  (rdd: org.apache.spark.api.java.JavaRDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
  (rdd: org.apache.spark.rdd.RDD[_],beanClass: Class[_])org.apache.spark.sql.DataFrame <and>
  (rows: java.util.List[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
  (rowRDD: org.apache.spark.api.java.JavaRDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame <and>
  (rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row],schema: org.apache.spark.sql.types.StructType)org.apache.spark.sql.DataFrame
 cannot be applied to (Unit, org.apache.spark.sql.types.StructType)
    val output_df = sqlContext.createDataFrame(rdd, schema)

第一种方法工作正常但我丢失了我的元组的第一个元素Error:(41, 24) No TypeTag available for MySchema val newdf = rdd.map(line => MySchema(line._1.toString, line._2.asInstanceOf[List[(Int, String)]])).toDF()

你知道我是否可以完成保留两个元素的第一种方法

我解决了它在字符串中转换我的元组但是根据我这不是优雅的解决方案,因为我必须拆分我的字符串元组来读取列:

.mapValues(i => i.map(_._2))

感谢您的帮助

1 个答案:

答案 0 :(得分:0)

GroupByKey为您提供了一个Seq of Tuples,您没有在架构中考虑到这一点。此外,sqlContext.createDataFrame需要RDD[Row],而您无法提供。{/ p>

这应该可以使用您的schema

val rdd = df.map(line => (line(0), (line(1), line(2))))
  .groupByKey()
  .mapValues(i => i.map(_._2))
  .map(i=>Row(i._1,i._2))

val output_df = sqlContext.createDataFrame(rdd, schema)

您还可以使用可用于映射元组的case class(不确定可以通过编程方式创建元组模式):

 val df = Seq(
      ("A", 1, "12/06/2012"),
      ("A", 2, "13/06/2012"),
      ("B", 3, "12/06/2012"),
      ("B", 4, "17/06/2012"),
      ("C", 5, "14/06/2012")).toDF("C1", "C2", "C3")
    df.show(false)

    val rdd = df.map(line => (line(0), (line(1), line(2))))
      .groupByKey()
      .mapValues(i => i.toList)

    // this should be placed outside of main()
    case class MySchema(C1: String, C4: List[(Int, String)])

    val newdf = rdd.map(line => MySchema(line._1.toString, line._2.asInstanceOf[List[(Int, String)]])).toDF()