我有 dataframe df ,如下所述:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
2 B X DEF 456 3
1 A 1 DEF 456 2
我想创建一个新的数据框df2 ,它只有唯一的客户ID ,但是 rule_name 和 rule_id < / strong>数据中同一客户的列不同,因此我想选择那些对同一客户具有最高优先级的记录,因此我的最终结果应为:
**customers** **product** **val_id** **rule_name** **rule_id** **priority**
1 A 1 ABC 123 1
3 Z r ERF 789 2
2 B X ABC 123 2
任何人都可以帮助我使用Spark scala实现它。任何帮助都将得到帮助。
答案 0 :(得分:4)
您基本上想要在列中选择具有极值的行。这是一个非常常见的问题,因此甚至还有一个完整的标记greatest-n-per-group。另请参阅这个问题SQL Select only rows with Max Value on a Column,它有一个很好的答案。
以下是您具体案例的示例。
请注意,这可以为客户选择多行,如果该客户有多行具有相同(最低)优先级值。
此示例位于pyspark
,但应该直接转换为Scala
# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")
答案 1 :(得分:0)
要创建 df2 ,您必须先按优先级订购 df ,然后按 ID 查找唯一身份用户。像这样:
val columns = df.schema.map(_.name).filterNot(_ == "customers").map(col => first(col).as(col))
val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show
它会给你预期的输出:
+----------+--------+-------+----------+--------+---------+
| customers| product| val_id| rule_name| rule_id| priority|
+----------+--------+-------+----------+--------+---------+
| 1| A| 1| ABC| 123| 1|
| 3| Z| r| ERF| 789| 2|
| 2| B| X| ABC| 123| 2|
+----------+--------+-------+----------+--------+---------+
答案 2 :(得分:0)
Corey打败了我,但这是Scala版本:
val df = Seq(
(1,"A","1","ABC",123,1),
(3,"Z","r","ERF",789,2),
(2,"B","X","ABC",123,2),
(2,"B","X","DEF",456,3),
(1,"A","1","DEF",456,2)).toDF("customers","product","val_id","rule_name","rule_id","priority")
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")
+---------+--------+-------+------+---------+-------+
|customers|priority|product|val_id|rule_name|rule_id|
+---------+--------+-------+------+---------+-------+
| 1| 1| A| 1| ABC| 123|
| 3| 2| Z| r| ERF| 789|
| 2| 2| B| X| ABC| 123|
+---------+--------+-------+------+---------+-------+
答案 3 :(得分:0)
您必须在min
列aggregation
priority
grouping
之后使用dataframe
customers
,然后inner join
original dataframe
aggregated dataframe
和select
所需列。
val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
.withColumnRenamed("customers", "customers_1")
val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))
finalDF.select("customers", "product", "val_id", "rule_name", "rule_id", "priority").show
你应该有所需的结果