当Row包含Map [Map]

时间:2019-03-19 06:59:21

标签: scala apache-spark

  

这个问题是other one的延续,用户   给出有效答案的人要求提出一个新问题,以解释我的进一步疑问。

我正在尝试从RDD[Objects]生成一个数据帧,其中我的对象既具有原始类型,也具有复杂类型。在前面的问题中,已经解释了如何解析复杂类型Map。

我接下来尝试的是推断给定的解决方案以解析Map [Map]。因此,在DataFrame中将其转换为Array(Map)。

下面我提供到目前为止已编写的代码:

//I get an Object from Hbase here
val objectRDD : RDD[HbaseRecord] = ... 

//I convert the RDD[HbaseRecord] into RDD[Row]
val rowRDD : RDD[Row] = objectRDD.map(
    hbaseRecord => {

        val uuid : String = hbaseRecord.uuid
        val timestamp : String = hbaseRecord.timestamp

        val name = Row(hbaseRecord.nameMap.firstName.getOrElse(""),
            hbaseRecord.nameMap.middleName.getOrElse(""),
            hbaseRecord.nameMap.lastName.getOrElse(""))

        val contactsMap = hbaseRecord.contactsMap 

        val homeContactMap = contactsMap.get("HOME")
        val homeContact = Row(homeContactMap.contactType,
            homeContactMap.areaCode,
            homeContactMap.number)

        val workContactMap = contactsMap.get("WORK")
        val workContact = Row(workContactMap.contactType,
            workContactMap.areaCode,
            workContactMap.number)

        val contacts = Row(homeContact,workContact)

        Row(uuid, timestamp, name, contacts)

    }
)


//Here I define the schema
   val schema = new StructType()
                    .add("uuid",StringType)
                    .add("timestamp", StringType)
                    .add("name", new StructType()
                            .add("firstName",StringType)
                            .add("middleName",StringType)
                            .add("lastName",StringType)
                    .add("contacts", new StructType(
                                   Array(
                                   StructField("contactType", StringType),
                                   StructField("areaCode", StringType),
                                   StructField("number", StringType)
                    )))  


//Now I try to create a Dataframe using the RDD[Row] and the schema
val dataFrame = sqlContext.createDataFrame(rowRDD , schema)

但是我遇到以下错误:

  

19/03/18 12:09:53错误执行程序。执行程序:任务0.0中的异常   阶段1.0(TID 8)scala.MatchError:[HOME,05,12345678](类   org.apache.spark.sql.catalyst.expressions.GenericRow)在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StringConverter $ .toCatalystImpl(CatalystTypeConverters.scala:295)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StringConverter $ .toCatalystImpl(CatalystTypeConverters.scala:294)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)     在   org.apache.spark.sql.catalyst.CatalystTypeConverters $$ anonfun $ createToCatalystConverter $ 2.apply(CatalystTypeConverters.scala:401)     在   org.apache.spark.sql.SQLContext $$ anonfun $ 6.apply(SQLContext.scala:492)     在   org.apache.spark.sql.SQLContext $$ anonfun $ 6.apply(SQLContext.scala:492)     在scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:328)在   scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:328)在   scala.collection.Iterator $$ anon $ 10.next(Iterator.scala:312)在   scala.collection.Iterator $ class.foreach(Iterator.scala:727)在   scala.collection.AbstractIterator.foreach(Iterator.scala:1157)在   scala.collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:48)     在   scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:103)     在   scala.collection.mutable.ArrayBuffer。$ plus $ plus $ eq(ArrayBuffer.scala:47)     在   scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:273)     在scala.collection.AbstractIterator.to(Iterator.scala:1157)在   scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:265)     在scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)     在   scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:252)     在scala.collection.AbstractIterator.toArray(Iterator.scala:1157)在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 5.apply(SparkPlan.scala:212)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 5.apply(SparkPlan.scala:212)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1858)     在   org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1858)     在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)     在org.apache.spark.scheduler.Task.run(Task.scala:89)在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:213)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)

我也尝试过将联系人元素生成为数组:

val contacts = Array(homeContact,workContact)

但是我却收到以下错误:

  

scala.MatchError:[Lorg.apache.spark.sql.Row; @ 726c6aec(类   [Lorg.apache.spark.sql.Row;)

有人可以发现问题吗?

1 个答案:

答案 0 :(得分:2)

让我们简化您的联系方式。那就是问题所在。您正在尝试使用以下架构:

val schema = new StructType()
                .add("contacts", new StructType(
                               Array(
                               StructField("contactType", StringType),
                               StructField("areaCode", StringType),
                               StructField("number", StringType)
                )))

存储联系人列表,这是一种结构类型。但是,此架构不能包含列表,只能包含一个联系人。我们可以使用以下方法进行验证:

spark.createDataFrame(sc.parallelize(Seq[Row]()), schema).printSchema
root
 |-- contacts: struct (nullable = true)
 |    |-- contactType: string (nullable = true)
 |    |-- areaCode: string (nullable = true)
 |    |-- number: string (nullable = true)

实际上,您的代码中的Array仅包含“联系人”结构类型的字段。

要实现所需的功能,可以使用一种类型:ArrayType。产生的结果略有不同:

val schema_ok = new StructType()
    .add("contacts", ArrayType(new StructType(Array(
        StructField("contactType", StringType),
        StructField("areaCode", StringType),
        StructField("number", StringType)))))

spark.createDataFrame(sc.parallelize(Seq[Row]()), schema_ok).printSchema
root
 |-- contacts: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- contactType: string (nullable = true)
 |    |    |-- areaCode: string (nullable = true)
 |    |    |-- number: string (nullable = true)

它有效:

val row = Row(Array(
                Row("type", "code", "number"), 
                Row("type2", "code2", "number2")))
spark.createDataFrame(sc.parallelize(Seq(row)), schema_ok).show(false)
+-------------------------------------------+
|contacts                                   |
+-------------------------------------------+
|[[type,code,number], [type2,code2,number2]]|
+-------------------------------------------+

因此,如果您使用此版本的“联系人”更新架构,只需将val contacts = Row(homeContact,workContact)替换为val contacts = Array(homeContact,workContact),它应该可以工作。

注意:如果要标记联系人(使用HOME或WORK),还存在一种MapType类型。