获取Spark中的唯一记录

时间:2017-06-04 05:58:51

标签: scala apache-spark apache-spark-sql greatest-n-per-group

我有 dataframe df ,如下所述:

**customers**   **product**   **val_id**  **rule_name**  **rule_id** **priority**
     1               A            1           ABC            123         1
     3               Z            r           ERF            789         2
     2               B            X           ABC            123         2
     2               B            X           DEF            456         3
     1               A            1           DEF            456         2      

我想创建一个新的数据框df2 ,它只有唯一的客户ID ,但是 rule_name rule_id < / strong>数据中同一客户的列不同,因此我想选择那些对同一客户具有最高优先级的记录,因此我的最终结果应为:

 **customers**   **product**   **val_id**  **rule_name**  **rule_id** **priority**
         1               A            1           ABC            123         1
         3               Z            r           ERF            789         2
         2               B            X           ABC            123         2

任何人都可以帮助我使用Spark scala实现它。任何帮助都将得到帮助。

4 个答案:

答案 0 :(得分:4)

您基本上想要在列中选择具有极值的行。这是一个非常常见的问题,因此甚至还有一个完整的标记。另请参阅这个问题SQL Select only rows with Max Value on a Column,它有一个很好的答案。

以下是您具体案例的示例。

请注意,这可以为客户选择多行,如果该客户有多行具有相同(最低)优先级值。

此示例位于pyspark,但应该直接转换为Scala

# find best priority for each customer. this DF has only two columns.
cusPriDF = df.groupBy("customers").agg( F.min(df["priority"]).alias("priority") )
# now join back to choose only those rows and get all columns back
bestRowsDF = df.join(cusPriDF, on=["customers","priority"], how="inner")

答案 1 :(得分:0)

要创建 df2 ,您必须先按优先级订购 df ,然后按 ID 查找唯一身份用户。像这样:

val columns = df.schema.map(_.name).filterNot(_ == "customers").map(col => first(col).as(col))

val df2 = df.orderBy("priority").groupBy("customers").agg(columns.head, columns.tail:_*).show

它会给你预期的输出:

+----------+--------+-------+----------+--------+---------+
| customers| product| val_id| rule_name| rule_id| priority|
+----------+--------+-------+----------+--------+---------+
|         1|       A|      1|       ABC|     123|        1|
|         3|       Z|      r|       ERF|     789|        2|
|         2|       B|      X|       ABC|     123|        2|
+----------+--------+-------+----------+--------+---------+

答案 2 :(得分:0)

Corey打败了我,但这是Scala版本:

val df = Seq(
(1,"A","1","ABC",123,1),
(3,"Z","r","ERF",789,2),
(2,"B","X","ABC",123,2),
(2,"B","X","DEF",456,3),
(1,"A","1","DEF",456,2)).toDF("customers","product","val_id","rule_name","rule_id","priority")
val priorities = df.groupBy("customers").agg( min(df.col("priority")).alias("priority"))
val top_rows = df.join(priorities, Seq("customers","priority"), "inner")

+---------+--------+-------+------+---------+-------+
|customers|priority|product|val_id|rule_name|rule_id|
+---------+--------+-------+------+---------+-------+
|        1|       1|      A|     1|      ABC|    123|
|        3|       2|      Z|     r|      ERF|    789|
|        2|       2|      B|     X|      ABC|    123|
+---------+--------+-------+------+---------+-------+

答案 3 :(得分:0)

您必须在minaggregation priority grouping之后使用dataframe customers,然后inner join original dataframe aggregated dataframeselect所需列。

val aggregatedDF = dataframe.groupBy("customers").agg(max("priority").as("priority_1"))
      .withColumnRenamed("customers", "customers_1")

    val finalDF = dataframe.join(aggregatedDF, dataframe("customers") === aggregatedDF("customers_1") && dataframe("priority") === aggregatedDF("priority_1"))
    finalDF.select("customers",   "product",   "val_id",  "rule_name",  "rule_id", "priority").show

你应该有所需的结果