我正在尝试将空列添加到embeded array [struct]列,通过这种方式,我将能够转换类似的复杂列:
case class Additional(id: String, item_value: String)
case class Element(income:String,currency:String,additional: Additional)
case class Additional2(id: String, item_value: String, extra2: String)
case class Element2(income:String,currency:String,additional: Additional2)
val my_uDF = fx.udf((data: Seq[Element]) => {
data.map(x=>new Element2(x.income,x.currency,new Additional2(x.additional.id,x.additional.item_value,null))).seq
})
sparkSession.sqlContext.udf.register("transformElements",my_uDF)
val result=sparkSession.sqlContext.sql("select transformElements(myElements),line_number,country,idate from entity where line_number='1'")
目标是添加到Element.Additional一个名为extra2的额外字段,因此,我将此字段与UDF映射,但由于以下原因而失败:
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (array<struct<income:string,currency:string,additional:struct<id:string,item_value:string>>>) => array<struct<income:string,currency:string,additional:struct<id:string,item_value:string,extra2:string>>>)
如果我打印“元素”字段的架构,则显示:
|-- myElements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- income: string (nullable = true)
| | |-- currency: string (nullable = true)
| | |-- additional: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- item_value: string (nullable = true)
我正在尝试转换为以下模式:
|-- myElements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- income: string (nullable = true)
| | |-- currency: string (nullable = true)
| | |-- additional: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- item_value: string (nullable = true)
| | | |-- extra2: string (nullable = true)
答案 0 :(得分:2)
使用map
简单地对DataFrame中的嵌套行元素执行必要的转换,并通过toDF
重命名列会更容易:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._
case class Additional(id: String, item_value: String)
case class Element(income: String, currency: String, additional: Additional)
case class Additional2(id: String, item_value: String, extra2: String)
case class Element2(income: String, currency: String, additional: Additional2)
val df = Seq(
(Seq(Element("70k", "US", Additional("1", "101")), Element("90k", "US", Additional("2", "202")))),
(Seq(Element("80k", "US", Additional("3", "303"))))
).toDF("myElements")
val df2 = df.map{ case Row(s: Seq[Row] @unchecked) => s.map{
case Row(income: String, currency: String, additional: Row) => additional match {
case Row(id: String, item_value: String) =>
Element2(income, currency, Additional2(id, item_value, null))
}}
}.toDF("myElements")
df2.show(false)
// +--------------------------------------------+
// |myElements |
// +--------------------------------------------+
// |[[70k, US, [1, 101,]], [90k, US, [2, 202,]]]|
// |[[80k, US, [3, 303,]]] |
// +--------------------------------------------+
df2.printSchema
// root
// |-- myElements: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- income: string (nullable = true)
// | | |-- currency: string (nullable = true)
// | | |-- additional: struct (nullable = true)
// | | | |-- id: string (nullable = true)
// | | | |-- item_value: string (nullable = true)
// | | | |-- extra2: string (nullable = true)
如果出于某种原因首选UDF,则所需的转换本质上是相同的:
val myUDF = udf((s: Seq[Row]) => s.map{
case Row(income: String, currency: String, additional: Row) => additional match {
case Row(id: String, item_value: String) =>
Element2(income, currency, Additional2(id, item_value, null))
}
})
val df2 = df.select(myUDF($"myElements").as("myElements"))
答案 1 :(得分:2)
这是另一种方法,它利用数据集而不是数据帧来实现对对象的直接访问而不是使用Row。还有一种名为asElement2
的附加方法,可以将Element
转换为Element2
。
case class Additional2(id: String, item_value: String, extra2: String)
case class Element2(income: String, currency: String, additional2: Additional2)
case class Additional(id: String, item_value: String)
case class Element(income:String, currency:String, additional: Additional){
def asElement2(): Element2 ={
val additional2 = Additional2(additional.id, additional.item_value, null)
Element2(income, currency, additional2)
}
}
val df = Seq(
(Seq(Element("150000", "EUR", Additional("001", "500EUR")))),
(Seq(Element("50000", "CHF", Additional("002", "1000CHF"))))
).toDS()
df.map{
se => se.map{_.asElement2}
}
//or even simpler
df.map{_.map{_.asElement2}}
输出:
+-------------------------------+
|value |
+-------------------------------+
|[[150000, EUR, [001, 500EUR,]]]|
|[[50000, CHF, [002, 1000CHF,]]]|
+-------------------------------+
最终模式:
root
|-- value: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- income: string (nullable = true)
| | |-- currency: string (nullable = true)
| | |-- additional2: struct (nullable = true)
| | | |-- id: string (nullable = true)
| | | |-- item_value: string (nullable = true)
| | | |-- extra2: string (nullable = true)