使用自定义函数激发聚合行

时间:2018-09-28 13:44:04

标签: apache-spark

为简单起见,假设我们有一个包含以下数据的数据框:

+----------+---------+----------+----------+
|firstName |lastName |Phone     |Address   |
+----------+---------+----------+----------+
|firstName1|lastName1|info1     |info2     |
|firstName1|lastName1|myInfo1   |dummyInfo2|
|firstName1|lastName1|dummyInfo1|myInfo2   |
+----------+---------+----------+----------+

如何合并按(firstName,lastName)分组的所有行,并保留以“ my”开头的“仅电话和地址”数据列,以获取以下信息:

+----------+---------+----------+----------+
|firstName |lastName |Phone     |Address   |
+----------+---------+----------+----------+
|firstName1|lastName1|myInfo1   |myInfo2   |
+----------+---------+----------+----------+

也许我应该将agg函数与自定义UDAF一起使用?但是我该如何实现呢?

注意:我正在使用Spark 2.2和Scala 2.11

谢谢您的时间

2 个答案:

答案 0 :(得分:1)

如果仅涉及两列,则可以使用过滤和联接来代替UDF:

.gridContainer{
    display:grid;

    grid-template-columns:1fr 1fr 1fr;
    padding-top:150px;

    grid-column-gap: 50px;
    padding-bottom:150px;
}
.grid-item {

background-color: #99e4f5;

border: 1px solid rgba(0, 0, 0, 0.8);

font-size: 17px;
text-align: center;
border: 5px solid #ffffff;
box-shadow: 1px 1px 5px grey;


color: #676363;

输出:

val df = List(
  ("firstName1", "lastName1", "info1", "info2"),
  ("firstName1", "lastName1", "myInfo1", "dummyInfo2"),
  ("firstName1", "lastName1", "dummyInfo1", "myInfo2")
).toDF("firstName", "lastName", "Phone", "Address")

val myPhonesDF = df.filter($"Phone".startsWith("my"))
val myAddressDF = df.filter($"Address".startsWith("my"))

val result = myPhonesDF.alias("Phones").join(myAddressDF.alias("Addresses"), Seq("firstName", "lastName"))
    .select("firstName", "lastName", "Phones.Phone", "Addresses.Address")
result.show(false)

对于许多列,当仅预期一行时,可以使用以下构造:

+----------+---------+-------+-------+
|firstName |lastName |Phone  |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+

输出相同。

带有两个参数示例的UDF:

  val columnsForSearch = List("Phone", "Address")
  val minExpressions = columnsForSearch.map(c => min(when(col(c).startsWith("my"), col(c)).otherwise(null)).alias(c))
  df.groupBy("firstName", "lastName").agg(minExpressions.head, minExpressions.tail: _*)

答案 1 :(得分:1)

您可以使用groupBycollect_set聚合函数并使用udf函数过滤以“ my”开头的第一个字符串

import org.apache.spark.sql.functions._
def myudf = udf((array: Seq[String]) => array.filter(_.startsWith("my")).head)

df.groupBy("firstName ", "lastName")
  .agg(myudf(collect_set("Phone")).as("Phone"), myudf(collect_set("Address")).as("Address"))
  .show(false)

应该给您

+----------+---------+-------+-------+
|firstName |lastName |Phone  |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+

我希望答案会有所帮助