Question

我试图在scala中编写udf函数，并在我的pyspark作业中使用它。我的数据框架构是

root
|-- vehicle_id: string
|-- driver_id: string
|-- StartDtLocal: timestamp
|-- EndDtLocal: timestamp
|-- trips: array
|    |-- element: struct
|    |    |-- week_start_dt_local: timestamp
|    |    |-- week_end_dt_local: timestamp
|    |    |-- start_dt_local: timestamp
|    |    |-- end_dt_local: timestamp
|    |    |-- StartDtLocal: timestamp
|    |    |-- EndDtLocal: timestamp
|    |    |-- vehicle_id: string
|    |    |-- duration_sec: float
|    |    |-- distance_km: float
|    |    |-- speed_distance_ratio: float
|    |    |-- speed_duration_ratio: float
|    |    |-- speed_event_distance_km: float
|    |    |-- speed_event_duration_sec: float
|-- trip_details: array
|    |-- element: struct
|    |    |-- event_start_dt_local: timestamp
|    |    |-- force: float
|    |    |-- speed: float
|    |    |-- sec_from_start: float
|    |    |-- sec_from_end: float
|    |    |-- StartDtLocal: timestamp
|    |    |-- EndDtLocal: timestamp
|    |    |-- vehicle_id: string
|    |    |-- trip_duration_sec: float

我正在尝试编写udf函数

def calculateVariables(row: Row):HashMap[String, Float] = {
    case class myRow(week_start_dt_local: Timestamp, week_end_dt_local: Timestamp, start_dt_local: Timestamp, end_dt_local :Timestamp, StartDtLocal:Timestamp,EndDtLocal:Timestamp,vehicle_id:String,duration_sec:Int,distance_km:Int,speed_distance_ratio:Float,speed_duration_ratio:Float,speed_event_distance_km:Float,speed_event_duration_sec:Float)

val trips = row.getAs[WrappedArray[myRow]](4)

在此地图函数中，我试图将行强制转换为case类，但无法执行。我遇到此错误。

java.lang.ClassCastException：org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema无法转换为VariableCalculation.VariableCalculation $ myRow $ 3

谁能帮我解决这个问题？

Answer 1

问题是.as上的Row只是没有进行类型转换。 trips的内部类型实际上是Row

因此row.getAs[WrappedArray[Row]]("trips")将起作用。那么您可以map在其上并从myRow构建Row。

您可能可以使用Sparks Encoder以某种方式自动执行此操作，但它们更适合用于整个数据集。

您是否考虑过为整个模式制定案例类，然后只做dataframe.as[MyCaseClass]？这样您就可以正确访问整个嵌套结构

如何将火花行（StructType）投射到scala case类

1 个答案: