当Row包含复杂类型时,如何从RDD [Row]创建Spark DataFrame

时间:2019-03-17 11:51:14

标签: scala apache-spark

我有一个RDD[HbaseRecord],其中包含一个自定义复杂类型Name。这两个类的定义如下:

class HbaseRecord(
      val uuid: String,
      val timestamp: String,
      val name: Name
)

class Name(    
    val firstName:                String,     
    val middleName:               String,       
    val lastName:                 String
)

在代码的某个时刻,我想从该RDD生成一个DataFrame,因此可以将其另存为avro文件。我尝试了以下方法:

//I get an Object from Hbase here
val objectRDD : RDD[HbaseRecord] = ... 

//I convert the RDD[HbaseRecord] into RDD[Row]
val rowRDD : RDD[Row] = objectRDD .map(
    hbaseRecord => {
      val uuid : String = hbaseRecord.uuid
      val timestamp : String = hbaseRecord.timestamp
      val name : Name = hbaseRecord.name

      Row(uuid, timestamp, name)
    })

//Here I define the schema
   val schema = new StructType()
                  .add("uuid",StringType)
                  .add("timestamp", StringType)
                  .add("name", new StructType()
                                  .add("firstName",StringType)
                                  .add("middleName",StringType)
                                  .add("lastName",StringType)

//Now I try to create a Dataframe using the RDD[Row] and the schema
val dataFrame = sqlContext.createDataFrame(rowRDD , schema)

但是我遇到以下错误:

  

scala.MatchError :(类java.lang.String的)位于   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:255)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $$ anonfun $ createToCatalystConverter $ 2.apply(CatalystTypeConverters.scala:401)     在   org.apache.spark.sql.SQLContext $$ anonfun $ 6.apply(SQLContext.scala:492)     在   org.apache.spark.sql.SQLContext $$ anonfun $ 6.apply(SQLContext.scala:492)     在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:328)在   scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:328)在   scala.collection.Iterator $$ anon $ 10.next(Iterator.scala:312)在   scala.collection.Iterator $ class.foreach(Iterator.scala:727)在   scala.collection.AbstractIterator.foreach(Iterator.scala:1157)在   scala.collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:48)     在   scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:103)     在   scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:47)     在   scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:273)     在scala.collection.AbstractIterator.to(Iterator.scala:1157)在   scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:265)     在scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)     在   scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:252)     在scala.collection.AbstractIterator.toArray(Iterator.scala:1157)在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 5.apply(SparkPlan.scala:212)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 5.apply(SparkPlan.scala:212)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1858)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1858)     在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)     在org.apache.spark.scheduler.Task.run(Task.scala:89)在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:213)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)

我尝试从行中删除复杂类型,因此它将为Row[String, String],然后没有错误。因此,我认为问题出在复杂类型上。

我在做什么错?还是我可以采用什么其他方法来生成具有复杂类型的DataFrame?

1 个答案:

答案 0 :(得分:1)

我只是为此使用了简单的case class而不是课堂。 name列不符合所定义的架构。 将name列转换为行类型,​​它应该可以工作。

val rowRDD : RDD[Row] = objectRDD .map(
    hbaseRecord => {
      val uuid : String = hbaseRecord.uuid
      val timestamp : String = hbaseRecord.timestamp
      val name = Row(hbaseRecord.name.firstName,
                     hbaseRecord.name.middleName,hbaseRecord.name.lastName)
      Row(uuid, timestamp, name)
    })