我有一个如下所示的数据框,并希望通过合并相邻的行来减少它们,即previous.close = current.open
val df = Seq(
("Ray","2018-09-01","2018-09-10"),
("Ray","2018-09-10","2018-09-15"),
("Ray","2018-09-16","2018-09-18"),
("Ray","2018-09-21","2018-09-27"),
("Ray","2018-09-27","2018-09-30"),
("Scott","2018-09-21","2018-09-23"),
("Scott","2018-09-24","2018-09-28"),
("Scott","2018-09-28","2018-09-30"),
("Scott","2018-10-05","2018-10-09"),
("Scott","2018-10-11","2018-10-15"),
("Scott","2018-10-15","2018-09-20")
)
所需的输出如下:
(("Ray","2018-09-01","2018-09-15"),
("Ray","2018-09-16","2018-09-18"),
("Ray","2018-09-21","2018-09-30"),
("Scott","2018-09-21","2018-09-23"),
("Scott","2018-09-24","2018-09-30"),
("Scott","2018-10-05","2018-10-09"),
("Scott","2018-10-11","2018-10-20"))
到目前为止,我可以使用下面的DF()解决方案来压缩相邻行。
df.alias("t1").join(df.alias("t2"),$"t1.name" === $"t2.name" and $"t1.close"=== $"t2.open" )
.select("t1.name","t1.open","t2.close")
.distinct.show(false)
|name |open |close |
+-----+----------+----------+
|Scott|2018-09-24|2018-09-30|
|Scott|2018-10-11|2018-09-20|
|Ray |2018-09-01|2018-09-15|
|Ray |2018-09-21|2018-09-30|
+-----+----------+----------+
我试图通过给$“ t1.close” =!= $“ t2.open”使用相似的样式来获取单行,然后将两者合并以得到最终结果。但是我得到了多余的行,但是我无法正确过滤。如何实现这一目标?
此帖子与Spark SQL window function with complex condition不同,该帖子将其他日期列计算为新列。
答案 0 :(得分:2)
这是一种方法:
temp1
等于先前的null
,则用open
值创建新列close
;否则为当前open
temp2
,该列用null
非空值回填temp1
中的last
name
,temp2
)对结果数据集进行分组以生成连续的日期范围我已经修改了您的示例数据,以涵盖连续2天以上的连续日期范围的情况。
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
("Ray","2018-09-01","2018-09-10"),
("Ray","2018-09-10","2018-09-15"),
("Ray","2018-09-16","2018-09-18"),
("Ray","2018-09-21","2018-09-27"),
("Ray","2018-09-27","2018-09-30"),
("Scott","2018-09-21","2018-09-23"),
("Scott","2018-09-23","2018-09-28"), // <-- Revised
("Scott","2018-09-28","2018-09-30"),
("Scott","2018-10-05","2018-10-09"),
("Scott","2018-10-11","2018-10-15"),
("Scott","2018-10-15","2018-10-20")
).toDF("name", "open", "close")
val win = Window.partitionBy($"name").orderBy("open", "close")
val df2 = df.
withColumn("temp1", when(
row_number.over(win) === 1 || lag($"close", 1).over(win) =!= $"open", $"open")
).
withColumn("temp2", last($"temp1", ignoreNulls=true).over(
win.rowsBetween(Window.unboundedPreceding, 0)
))
df2.show
// +-----+----------+----------+----------+----------+
// | name| open| close| temp1| temp2|
// +-----+----------+----------+----------+----------+
// |Scott|2018-09-21|2018-09-23|2018-09-21|2018-09-21|
// |Scott|2018-09-23|2018-09-28| null|2018-09-21|
// |Scott|2018-09-28|2018-09-30| null|2018-09-21|
// |Scott|2018-10-05|2018-10-09|2018-10-05|2018-10-05|
// |Scott|2018-10-11|2018-10-15|2018-10-11|2018-10-11|
// |Scott|2018-10-15|2018-10-20| null|2018-10-11|
// | Ray|2018-09-01|2018-09-10|2018-09-01|2018-09-01|
// | Ray|2018-09-10|2018-09-15| null|2018-09-01|
// | Ray|2018-09-16|2018-09-18|2018-09-16|2018-09-16|
// | Ray|2018-09-21|2018-09-27|2018-09-21|2018-09-21|
// | Ray|2018-09-27|2018-09-30| null|2018-09-21|
// +-----+----------+----------+----------+----------+
上面显示了步骤1
和2
的结果,其中temp2
保留了相应连续日期范围中最早的open
的值。步骤3
使用max
获取日期范围的最新close
:
df2.
groupBy($"name", $"temp2".as("open")).agg(max($"close").as("close")).
show
// +-----+----------+----------+
// |name |open |close |
// +-----+----------+----------+
// |Scott|2018-09-21|2018-09-30|
// |Scott|2018-10-05|2018-10-09|
// |Scott|2018-10-11|2018-10-20|
// |Ray |2018-09-01|2018-09-15|
// |Ray |2018-09-16|2018-09-18|
// |Ray |2018-09-21|2018-09-30|
// +-----+----------+----------+
答案 1 :(得分:1)
已更新:代码现已通过测试:-)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{coalesce, datediff, lag, lit, min, sum}
val df = Seq(
("Ray","2018-09-01","2018-09-10"),
("Ray","2018-09-10","2018-09-15"),
("Ray","2018-09-16","2018-09-18"),
("Ray","2018-09-21","2018-09-27"),
("Ray","2018-09-27","2018-09-30"),
("Scott","2018-09-21","2018-09-23"),
("Scott","2018-09-23","2018-09-28"), // <-- Revised
("Scott","2018-09-28","2018-09-30"),
("Scott","2018-10-05","2018-10-09"),
("Scott","2018-10-11","2018-10-15"),
("Scott","2018-10-15","2018-10-20")
).toDF("name", "open", "close")
val window = Window.partitionBy("name").orderBy($"open").rowsBetween(-1, Window.currentRow) //<- only compare the dates of a certain name, and for each row look also look at the previous one
df.select(
$"name", $"open", $"close",
min($"close").over(window) as "closeBefore_tmp"//<- get the smaller close value (that of the previous entry)
)
.withColumn("closeBefore", when($"closeBefore_tmp" === $"close", null).otherwise($"closeBefore_tmp")) //<- in this case there was no previous row: its the first for this user, so set closeBefore to null
.createOrReplaceTempView("tmp")
现在您可以compare
打开并closeBefore
。