我有一个工作订单的数据转储,如下所示。我需要确定所有具有“进行中”和“已完成”状态的订单。
此外,只有在处于“完成/无效”状态的“进行中”状态时才需要显示。我在下面提到的输出。 Spark中我可以遵循的最佳方法是什么?输入和输出都附在此处。
输入
Work_ Req_Id,Assigned to,Date,Status
R1,John,3/4/15,In Progress
R1,George,3/5/15,In Progress
R2,Peter,3/6/15,In Progress
R3,Alaxender,3/7/15,Finished
R3,Alaxender,3/8/15,In Progress
R4,Patrick,3/9/15,Finished
R4,Patrick,3/10/15,Not Valid
R5,Peter,3/11/15,Finished
R6,,3/12/15,Not Valid
R7,George,3/13/15,Not Valid
R7,George,3/14/15,In Progress
R8,John,3/15/15,Finished
R8,John,3/16/15,Failed
R9,Alaxender,3/17/15,Finished
R9,John,3/18/15,Removed
R10,Patrick,3/19/15,In Progress
R10,Patrick,3/20/15,Finished
R10,Peter,3/21/15,Hold
输出
Work_ Req_Id,Assigned to,Date,Status
R3,Alaxender,3/7/15,Finished
R3,Alaxender,3/8/15,In Progress
R7,George,3/13/15,Not Valid
R7,George,3/14/15,In Progress
R10,Patrick,3/19/15,In Progress
R10,Patrick,3/20/15,Finished
R10,Peter,3/21/15,Hold
答案 0 :(得分:1)
您可以使用groupBy
与collect_list
一起收集每个Work_Req_Id
的状态列表以及UDF
来过滤所需的状态。然后,分组的数据框与原始数据帧连接。
这里没有提出窗口函数,因为Spark 1.6在窗口操作中似乎不支持collect_list/collect_set
。
val df = Seq(
("R1", "John", "3/4/15", "In Progress"),
("R1", "George", "3/5/15", "In Progress"),
("R2", "Peter", "3/6/15", "In Progress"),
("R3", "Alaxender", "3/7/15", "Finished"),
("R3", "Alaxender", "3/8/15", "In Progress"),
("R4", "Patrick", "3/9/15", "Finished"),
("R4", "Patrick", "3/10/15", "Not Valid"),
("R5", "Peter", "3/11/15", "Finished"),
("R6", "", "3/12/15", "Not Valid"),
("R7", "George", "3/13/15", "Not Valid"),
("R7", "George", "3/14/15", "In Progress"),
("R8", "John", "3/15/15", "Finished"),
("R8", "John", "3/16/15", "Failed"),
("R9", "Alaxender", "3/17/15", "Finished"),
("R9", "John", "3/18/15", "Removed"),
("R10", "Patrick", "3/19/15", "In Progress"),
("R10", "Patrick", "3/20/15", "Finished"),
("R10", "Patrick", "3/21/15", "Hold")
).toDF("Work_Req_Id", "Assigned_To", "Date", "Status")
def wanted = udf(
(statuses: Seq[String]) => statuses.contains("In Progress") &&
(statuses.contains("Finished") || statuses.contains("Not Valid"))
)
val df2 = df.groupBy($"Work_Req_Id").agg(collect_list($"Status").as("Statuses")).
where( wanted($"Statuses") ).
drop($"Statuses")
df.join(df2, Seq("Work_Req_Id")).show
// +-----------+-----------+-------+-----------+
// |Work_Req_Id|Assigned_To| Date| Status|
// +-----------+-----------+-------+-----------+
// | R3| Alaxender| 3/7/15| Finished|
// | R3| Alaxender| 3/8/15|In Progress|
// | R7| George|3/13/15| Not Valid|
// | R7| George|3/14/15|In Progress|
// | R10| Patrick|3/19/15|In Progress|
// | R10| Patrick|3/20/15| Finished|
// | R10| Patrick|3/21/15| Hold|
// +-----------+-----------+-------+-----------+