我在rdd
中有以下类型
org.apache.spark.rdd.RDD[((String, String), (Array[Byte], Boolean))]
我将rdd
写到了实木复合地板中
val myDf = spark.createDataFrame(myRdd).toDF("id", "vals")
myDf.write.parquet("./myParquetDir")
现在我要检索它
我看到它的格式如下
Parquet form:
message spark_schema {
optional group myId {
optional binary _1 (UTF8);
optional binary _2 (UTF8);
}
optional group myVal {
optional binary _1;
required boolean _2;
}
}
Catalyst form:
StructType(StructField(myId,StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true), StructField(myVal,StructType(StructField(_1,BinaryType,true), StructField(_2,BooleanType,true)),true))
所以我创建了
case class MySchema(
myId: (String, String),
myVal: (Array[Byte], Boolean)
)
val myParquetFileDf = spark.read.parquet("./myParquetDir")
val myParseDf = myParquetFileDf.as[MySchema].map( row=>
println(row)
row
).show(1, false)
但是出现错误
org.apache.spark.sql.AnalysisException: Try to map struct<myId:struct<_1:string,_2:string>,myVal:struct<_1:binary,_2:boolean>> to Tuple1, but failed as the number of fields does not line up.;
我不明白,df模式不应该精确映射到我的rdd类型吗?
有人可以给我指明方向吗?