对spark scala中的数据进行去标准化

时间:2017-11-02 19:23:20

标签: scala apache-spark

我有以下从csv中读取的架构:

val PersonSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("Name",StringType,true)))
val AddressSchema = StructType(Array(StructField("PersonID",StringType,true), StructField("StreetNumber",StringType,true), StructField("StreetName",StringType,true)))

一个人可以拥有多个地址,并通过PersonID相关联。

有人可以帮助将记录转换为PersonAddress记录,如以下案例类定义吗?

case class Address(StreetNumber:String, StreetName:String)
case class PersonAddress(PersonID:String, Name:String, Addresses:Array[Address])

我尝试了以下内容,但它在最后一步中给出了异常:

val results = personData.join(addressData, Seq("PersonID"), "left_outer").groupBy("PersonID","Name").agg(collect_list(struct("StreetNumber","StreetName")) as "Addresses")
val personAddresses = results .map(data => PersonAddress(data.getAs("PersonID"),data.getAs("Name"),data.getAs("Addresses")))
personAddresses.show

给出错误:

  

java.lang.ClassCastException:scala.collection.mutable.WrappedArray $ ofRef不能转换为$ line26。$ read $$ iw $$ iw $ Address

1 个答案:

答案 0 :(得分:0)

这种最简单的解决方案是使用UDF。首先,将街道号码和名称收集为两个单独的列表,然后使用UDF将所有内容转换为PersonAddress的数据框。

val convertToCase = udf((id: String, name: String, streetName: Seq[String], streetNumber: Seq[String]) => {
  val addresses = streetNumber.zip(streetName) 
  PersonAddress(id, name, addresses.map(t => Address(t._1, t._2)).toArray)
})

val results = personData.join(addressData, Seq("PersonID"), "left_outer")
  .groupBy("PersonID","Name")
  .agg(collect_list($"StreetNumber").as("StreetNumbers"), 
       collect_list($"StreetName").as("StreetNames"))
val personAddresses = results.select(convertToCase($"PersonID", $"Name", $"StreetNumbers", $"StreetNames").as("Person"))

这将为您提供如下架构。

root
 |-- Person: struct (nullable = true)
 |    |-- PersonID: string (nullable = true)
 |    |-- Name: string (nullable = true)
 |    |-- Addresses: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- StreetNumber: string (nullable = true)
 |    |    |    |-- StreetName: string (nullable = true)