Spark 1.6 SQL Windows

时间:2018-03-24 18:39:44

标签: apache-spark apache-spark-sql spark-dataframe

我有一个员工数据转储如下。我需要确定那些只具有状态'继续'的员工。我在下面提到的输出。 Spark中我可以遵循的最佳方法是什么?输入和输出附在此处。 Input

输入 Emp_Id Emp_Age Emp_Status Emp_Name Date_Updated

1 43继续约翰3/3/15
1 43​​继续约翰3/4/15
2 35继续彼得3/5/15
3 32完成Alaxender 3/6/15
3 32继续Alaxender 3/7/15
4 45继续Patrick 3/8/15
4 45无信息Patrick 3/9/15

Output

输出 Emp_Id Emp_Age Emp_Status Emp_Name Date_Updated

1 43继续约翰3/3/15
1 43​​继续约翰3/4/15
2 35继续彼得3/5/15

由于

2 个答案:

答案 0 :(得分:0)

难道你不能只对groupBy员工进行分组并检查“继续”与总数的比较吗?

val employees = input
   .groupBy("Emp_Name")
   .agg(
    count("*").as("total_status"),
    count(when(col("Emp_Status") === "Continue",1)).as("total_continues")
    )
    .filter(col("total_status") === col("total_continues"))
    .select("Emp_Name")

//List of employees
val liste = employees.map(x => x.mkString).collect.toList

//Get the final output
val output = input.filter(col("Emp_Name").isin(liste:_*))

答案 1 :(得分:0)

您可以使用collect_set而不是按员工分区的窗口为每位员工收集不同的statuses,然后只选择statuses等于Array("Continue")的结果数据集:

val df = Seq(
  (1, 43, "Continue", "John", "3/3/15"),
  (1, 43, "Continue", "John", "3/4/15"),
  (2, 35, "Continue", "Peter", "3/5/15"),
  (3, 32, "Finished", "Alaxender", "3/6/15"),
  (3, 32, "Continue", "Alaxender", "3/7/15"),
  (4, 45, "Continue", "Patrick", "3/8/15"),
  (4, 45, "No Information", "Patrick", "3/9/15")
)toDF("Emp_Id", "Emp_Age", "Emp_Status", "Emp_Name", "Date_Updated")

import org.apache.spark.sql.expressions.Window

df.withColumn(
    "Statuses",
    collect_set($"Emp_Status").over(Window.partitionBy($"Emp_Id"))
  ).
  where($"Statuses" === Array("Continue")).
  drop($"Statuses").
  show

// +------+-------+----------+--------+------------+
// |Emp_Id|Emp_Age|Emp_Status|Emp_Name|Date_Updated|
// +------+-------+----------+--------+------------+
// |     1|     43|  Continue|    John|      3/3/15|
// |     1|     43|  Continue|    John|      3/4/15|
// |     2|     35|  Continue|   Peter|      3/5/15|
// +------+-------+----------+--------+------------+

[UPDATE]

[UPDATE]

Spark 1.6中提供了

collect_set(),但听起来该版本的窗口操作不支持该方法。解决方法是使用groupBY后跟join

val df2 = df.groupBy($"Emp_Id").agg(collect_set($"Emp_Status").as("Statuses")).
  where($"Statuses" === lit(Array("Continue"))).
  drop($"Statuses")

df.join(df2, Seq("Emp_Id")).show