Question

我有以下架构，我想添加一个名为distance的新列。此列计算每行的两个时间序列之间的距离：time_series1和time_series2

|-- websites: struct (nullable = true)
|    |-- _1: integer (nullable = false)
|    |-- _2: integer (nullable = false)
|-- countryId1: integer (nullable = false)
|-- countryId2: integer (nullable = false)
|-- time_series1: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- _1: float (nullable = false)
|    |    |-- _2: date (nullable = true)
|-- time_series2: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- _1: float (nullable = false)
|    |    |-- _2: date (nullable = true)

所以我使用udf函数将这个新列定义为：

val step2= step1
  .withColumn("distance",  distanceUDF(col("time_series1"),col("time_series2")))
  .select("websites","countryId1","countryId2","time_series1","time_series2","distance")

和UDF：

 val distanceUDF  = udf( (ts1:Seq[(Float,_)], ts2:Seq[(Float,_)])=>
                            compute_distance( ts1.map(_._1) , ts2.map(_._1)))

但我在映射上有问题，我不知道如何映射数组（struct（float，date）.to scala。

Seq[(Float,Date)]是否等于数组（struct（float，date））？我有以下例外：

java.lang.ClassCastException: .GenericRowWithSchema cannot be cast to scala.Tuple2

我的问题与此处曝光的问题不同Spark Sql UDF with complex input parameter。我有一个有序的时间序列与日期（我有一个数组，而不仅仅是结构类型）

Answer 1

您添加的链接可以解答您的问题

结构类型转换为o.a.s.sql.Row

所以你的函数应该有两个Seq [Row]参数。然后你可以使用Row api来获取花车。

在这种情况下，您可能希望使用Datasets。有关嵌套类型的更多信息，您可以观看The Joy of Nested Types

Spark数组strcut和UDF

1 个答案: