如何在Spark DataFrame中删除重复项

时间:2019-01-24 21:22:50

标签: apache-spark apache-spark-sql

问题陈述

确定哪对演员合作最多。合作 被定义为出现在同一部电影中。输出应该有三个 列:演员1,演员2和人数。输出应按计数排序 降序排列。解决这个问题将需要自我加入。

解决方案

我有以下查询来解决它和输出。输出具有参与者1和参与者2的重复值,所以我想知道如何删除这些重复项,

val df = movies.as("set").join(movies.as("anotherSet"), $"set.movie_title" === $"anotherSet.movie_title" && $"set.actor_name" =!= $"anotherSet.actor_name")
    .groupBy($"set.actor_name".as("actor 1"), $"anotherSet.actor_name".as("actor 2"))
    .count()
    .orderBy($"count".desc)

+-----------------+------------------+-----+
|          actor 1|           actor 2|count|
+-----------------+------------------+-----+
| Lynn, Sherry (I)|   McGowan, Mickie|   23|
|  McGowan, Mickie|  Lynn, Sherry (I)|   23|
| Lynn, Sherry (I)|   Bergen, Bob (I)|   19|
|  Bergen, Bob (I)|   McGowan, Mickie|   19|
|  McGowan, Mickie|   Bergen, Bob (I)|   19|
|  Bergen, Bob (I)|  Lynn, Sherry (I)|   19|
|  McGowan, Mickie|   Angel, Jack (I)|   17|
|  Angel, Jack (I)|   McGowan, Mickie|   17|
|  Angel, Jack (I)|  Lynn, Sherry (I)|   17|
| Lynn, Sherry (I)|   Angel, Jack (I)|   17|
|  McGowan, Mickie|       Rabson, Jan|   16|
| Lynn, Sherry (I)|       Rabson, Jan|   16|
|      Rabson, Jan|   McGowan, Mickie|   16|
|      Rabson, Jan|  Lynn, Sherry (I)|   16|
|Darling, Jennifer|   McGowan, Mickie|   15|
|  McGowan, Mickie| Darling, Jennifer|   15|
|  Bergen, Bob (I)|     Harnell, Jess|   14|
|Darling, Jennifer|  Lynn, Sherry (I)|   14|
|Sandler, Adam (I)|Schneider, Rob (I)|   14|
|    Harnell, Jess|   Bergen, Bob (I)|   14|
+-----------------+------------------+-----+

2 个答案:

答案 0 :(得分:2)

使用select EmpID, managerID, Group, quant as sales, sum(case when active_date > current_date - 25 days then quant else 0 end) over (partition by empId) as emp_25, sum(case when active_date > current_date - 25 days then quant else 0 end) over (partition by managerId) as manager_25, sum(case when active_date > current_date - 25 days then quant else 0 end) over (partition by group) as group_25 from products p where active_date > CURRENT_DATE - 50 days; leastgreatest(a,b)之类的对视为相同。

(b,a)

答案 1 :(得分:0)

您还可以在行级别进行比较和排序,然后获取不同的记录以按Actor 1,Actor 2进行计数。

类似这样的东西:

var df1 = m1.join(m2, m1("Movie") === m2("Movie") && m1("Actor") =!= m2("Actor")).
  select(m1("Movie"), 
         when(m1("Actor") < m2("Actor"),m1("Actor")).otherwise(m2("Actor")).as("Actor 1"), 
         when(m1("Actor") > m2("Actor"),m1("Actor")).otherwise(m2("Actor")).as("Actor 2")).
 distinct.groupBy("Actor 1","Actor 2").count