在我的Scala程序中,我有一个具有以下架构的数据框:
root
|-- FIRST_NAME: string (nullable = true)
|-- LAST_NAME: string (nullable = true)
|-- SEGMENT_EMAIL: array (nullable = true)
| |-- element: string (containsNull = true)
|-- SEGMENT_ADDRESS_STATE: array (nullable = true)
| |-- element: string (containsNull = true)
|-- SEGMENT_ADDRESS_POSTAL_CODE: array (nullable = true)
| |-- element: string (containsNull = true)
一些样本值是:
|FIRST_NAME |LAST_NAME |CONFIRMATION_NUMBER| SEGMENT_EMAIL|SEGMENT_ADDRESS_STATE|SEGMENT_ADDRESS_POSTAL_CODE|
+----------------+---------------+-------------------+--------------------+---------------------+---------------------------+
| Stine| Rocha| [48978451]|[Xavier.Vich@gmail..| [MA]| [01545-1300]|
| Aurora| Markusson| [26341542]| []| [AR]| [72716]|
| Stine| Rocha| [29828771]|[Xavier.Vich@gmail..| [OH]| [45101-9613]|
| Aubrey| Fagerland| [24572991]|[Aubrey.Fagerland...| []| []|
当列值为列表形式时,如何基于“名字+姓氏+电子邮件”对相似的记录进行分组。
我想要这样的输出:
|FIRST_NAME |LAST_NAME |CONFIRMATION_NUMBER | SEGMENT_EMAIL|SEGMENT_ADDRESS_STATE|SEGMENT_ADDRESS_POSTAL_CODE|
+----------------+---------------+---------------------+--------------------+---------------------+---------------------------+
| Stine| Rocha| [48978451, 29828771]|[Xavier.Vich@gmail..| [MA, OH]| [01545-1300, 45101-9613]|
| Aurora| Markusson| [26341542]| []| [AR]| [72716]|
| Aubrey| Fagerland| [24572991]|[Aubrey.Fagerland...| []| []|
谢谢!
答案 0 :(得分:0)
这可以通过编写用户定义函数以将多个Seq
合并到一个Seq
中来完成。这是获取所需输出的方法:
创建输入数据框:尽管模式中未提及CONFIRMATION_NUMBER
字段的数据类型,但我将其假定为整数。
import spark.implicits._
val df = Seq(("Stine", "Rocha", Seq(48978451), Seq("Xavier.Vich@gmail"), Seq("MA"), Seq("01545-1300")),
("Aurora", "Markusson", Seq(26341542),Seq(),Seq("AR"),Seq("72716")),
("Stine", "Rocha", Seq(29828771),Seq("Xavier.Vich@gmail"),Seq("OH"), Seq("45101-9613")),
("Aubrey", "Fagerland",Seq(24572991),Seq("Aubrey.Fagerland"),Seq(), Seq())).
toDF("FIRST_NAME", "LAST_NAME", "CONFIRMATION_NUMBER", "SEGMENT_EMAIL", "SEGMENT_ADDRESS_STATE", "SEGMENT_ADDRESS_POSTAL_CODE")
聚合列::现在将聚合应用于所需的列,以获取Seq
中的Seq
。这是执行此操作的代码:
import org.apache.spark.sql.functions.collect_list
val df1 = df.groupBy("FIRST_NAME", "LAST_NAME").
agg(collect_list("CONFIRMATION_NUMBER").as("cnlist"),
collect_list("SEGMENT_EMAIL").as("selist"),
collect_list("SEGMENT_ADDRESS_STATE").as("saslist"),
collect_list("SEGMENT_ADDRESS_POSTAL_CODE").as("sapclist"))
这是df1
的输出:
+----------+---------+------------------------+------------------------------------------+------------+----------------------------+
|FIRST_NAME|LAST_NAME|cnlist |selist |saslist |sapclist |
+----------+---------+------------------------+------------------------------------------+------------+----------------------------+
|Stine |Rocha |[[48978451], [29828771]]|[[Xavier.Vich@gmail], [Xavier.Vich@gmail]]|[[MA], [OH]]|[[01545-1300], [45101-9613]]|
|Aurora |Markusson|[[26341542]] |[[]] |[[AR]] |[[72716]] |
|Aubrey |Fagerland|[[24572991]] |[[Aubrey.Fagerland]] |[[]] |[[]] |
+----------+---------+------------------------+------------------------------------------+------------+----------------------------+
应用udf:现在应用用户定义的函数(udf)将数组数组合并为单个数组。我已经为整数和字符串数据类型编写了两个udf。
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.functions.udf
val concat_nested_string_seq:UserDefinedFunction = udf((seq_values:Seq[Seq[String]]) => {
var output_seq:Seq[String] = Seq()
seq_values.foreach(output_seq ++= _)
(output_seq)
})
val concat_nested_integer_seq:UserDefinedFunction = udf((seq_values:Seq[Seq[Integer]]) => {
var output_seq:Seq[Integer] = Seq()
seq_values.foreach(output_seq ++= _)
(output_seq)
})
val output_df = df1.withColumn("CONFIRMATION_NUMBER", concat_nested_integer_seq($"cnlist")).
withColumn("SEGMENT_EMAIL", concat_nested_string_seq($"selist")).
withColumn("SEGMENT_ADDRESS_STATE", concat_nested_string_seq($"saslist")).
withColumn("SEGMENT_ADDRESS_POSTAL_CODE", concat_nested_string_seq($"sapclist")).
drop("cnlist", "selist", "saslist", "sapclist")
output_df
数据帧显示所需的输出。也可以通过展平数组数据类型的列,然后在列上进行聚合来解决。但这可能是昂贵的操作。