我有以下从csv中读取的架构:
val PersonSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("Name",StringType,true)))
val AddressSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("StreetNumber",StringType,true), StructField("StreetName",StringType,true)))
一个人可以拥有多个地址,并通过PersonID相关联。
有人可以帮助将记录转换为PersonAddress记录,如以下案例类定义吗?
case class Address(StreetNumber:String, StreetName:String)
case class PersonAddress(PersonID:String, Name:String, Addresses:Array[Address])
我尝试了以下内容,但它在最后一步中给出了异常:
val results = personData.join(addressData, Seq("PersonID"), "left_outer").groupBy("PersonID","Name").agg(collect_list(struct("StreetNumber","StreetName")) as "Addresses")
val personAddresses = results .map(data => PersonAddress(data.getAs("PersonID"),data.getAs("Name"),data.getAs("Addresses")))
personAddresses.show
给出错误:
java.lang.ClassCastException:scala.collection.mutable.WrappedArray $ ofRef不能转换为$ line26。$ read $$ iw $$ iw $ Address
答案 0 :(得分:0)
这种最简单的解决方案是使用UDF
。首先,将街道号码和名称收集为两个单独的列表,然后使用UDF
将所有内容转换为PersonAddress
的数据框。
val convertToCase = udf((id: String, name: String, streetName: Seq[String], streetNumber: Seq[String]) => {
val addresses = streetNumber.zip(streetName)
PersonAddress(id, name, addresses.map(t => Address(t._1, t._2)).toArray)
})
val results = personData.join(addressData, Seq("PersonID"), "left_outer")
.groupBy("PersonID","Name")
.agg(collect_list($"StreetNumber").as("StreetNumbers"),
collect_list($"StreetName").as("StreetNames"))
val personAddresses = results.select(convertToCase($"PersonID", $"Name", $"StreetNumbers", $"StreetNames").as("Person"))
这将为您提供如下架构。
root
|-- Person: struct (nullable = true)
| |-- PersonID: string (nullable = true)
| |-- Name: string (nullable = true)
| |-- Addresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- StreetNumber: string (nullable = true)
| | | |-- StreetName: string (nullable = true)