为简单起见,假设我们有一个包含以下数据的数据框:
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|info1 |info2 |
|firstName1|lastName1|myInfo1 |dummyInfo2|
|firstName1|lastName1|dummyInfo1|myInfo2 |
+----------+---------+----------+----------+
如何合并按(firstName,lastName)分组的所有行,并保留以“ my”开头的“仅电话和地址”数据列,以获取以下信息:
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|myInfo1 |myInfo2 |
+----------+---------+----------+----------+
也许我应该将agg函数与自定义UDAF一起使用?但是我该如何实现呢?
注意:我正在使用Spark 2.2和Scala 2.11
谢谢您的时间
答案 0 :(得分:1)
如果仅涉及两列,则可以使用过滤和联接来代替UDF:
.gridContainer{
display:grid;
grid-template-columns:1fr 1fr 1fr;
padding-top:150px;
grid-column-gap: 50px;
padding-bottom:150px;
}
.grid-item {
background-color: #99e4f5;
border: 1px solid rgba(0, 0, 0, 0.8);
font-size: 17px;
text-align: center;
border: 5px solid #ffffff;
box-shadow: 1px 1px 5px grey;
color: #676363;
输出:
val df = List(
("firstName1", "lastName1", "info1", "info2"),
("firstName1", "lastName1", "myInfo1", "dummyInfo2"),
("firstName1", "lastName1", "dummyInfo1", "myInfo2")
).toDF("firstName", "lastName", "Phone", "Address")
val myPhonesDF = df.filter($"Phone".startsWith("my"))
val myAddressDF = df.filter($"Address".startsWith("my"))
val result = myPhonesDF.alias("Phones").join(myAddressDF.alias("Addresses"), Seq("firstName", "lastName"))
.select("firstName", "lastName", "Phones.Phone", "Addresses.Address")
result.show(false)
对于许多列,当仅预期一行时,可以使用以下构造:
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
输出相同。
带有两个参数示例的UDF:
val columnsForSearch = List("Phone", "Address")
val minExpressions = columnsForSearch.map(c => min(when(col(c).startsWith("my"), col(c)).otherwise(null)).alias(c))
df.groupBy("firstName", "lastName").agg(minExpressions.head, minExpressions.tail: _*)
答案 1 :(得分:1)
您可以使用groupBy
和collect_set
聚合函数并使用udf
函数过滤以“ my”开头的第一个字符串
import org.apache.spark.sql.functions._
def myudf = udf((array: Seq[String]) => array.filter(_.startsWith("my")).head)
df.groupBy("firstName ", "lastName")
.agg(myudf(collect_set("Phone")).as("Phone"), myudf(collect_set("Address")).as("Address"))
.show(false)
应该给您
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
我希望答案会有所帮助