如何在SparkSQL中给定StructType作为架构创建ArrayData或InternalRow?

时间:2018-12-04 12:27:04

标签: scala apache-spark apache-spark-sql

在SparkSQL中定义UDT时,我制作了这样的UDT

class trajUDT extends UserDefinedType[traj] {
  override def sqlType: DataType = StructType(Seq(
    StructField("id", DataTypes.StringType),
    StructField("loc", ArrayType(StructType(Seq(
      StructField("x",DataTypes.DoubleType),
      StructField("y",DataTypes.DoubleType)
    ))))
 ))
 ...
 }

traj是一个类

class traj(val id:UTF8String,val loc:Array[Tuple2[Double,Double]] )

我想写一个这样的序列化函数

override def serialize(p: traj): GenericInternalRow = {
  new GenericInternalRow(Array[Any](p.id,p.loc.map(x=>Array(x._1,x._2)))
}

但是它失败了,因为它告诉我不能将其转换为ArrayData。

我还编写了这样的反序列化函数:

override def deserialize(datum: Any): traj = {
  val arr=datum.asInstanceOf[InternalRow]
  val id = arr.getUTF8String(0)
  val xytype=StructType(Seq(
    StructField("x",DataTypes.DoubleType),
    StructField("y",DataTypes.DoubleType)
  ))
  val xy = arr.getArray(1)
  val xye =xy.toArray[Tuple2[Double,Double]](xytype)
  new traj(id,xye)
}

我想这也行不通...

那么有人可以教我如何进行这两个转换吗?

1 个答案:

答案 0 :(得分:0)

InternalRow一起工作时,我遇到了类似的问题

InternalRowArray构造Seq会导致 java.lang.ClassCastException

import org.apache.spark.sql.catalyst.InternalRow

val row = InternalRow(Array(1, 2, 3), 1L)
println(s"Row first element: ${row.getArray(0).toIntArray.toVector}")
println(s"Row second element: ${row.getLong(1)}")
java.lang.ClassCastException: [I cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
    at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getArray(rows.scala:48)
    at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)

我通过传递ArrayData字段而不是ArraySeq来解决此问题。我使用了ArrayData.toArrayData方法,如下所示:

import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.util.ArrayData

val row = InternalRow(ArrayData.toArrayData(Array(1, 2, 3)), 1L)
println(s"Row first element: ${row.getArray(0).toIntArray.toVector}")
println(s"Row second element: ${row.getLong(1)}")
Row first element: Vector(1, 2, 3)
Row second element: 1