我很难找到一个好的方法来过滤spark数据集。我在下面描述了基本问题:
null
输入
+-----------+----------+-------------------+
|key |statusCode|statusTimestamp |
+-----------+----------+-------------------+
|AAAAAABBBBB|OA |2019-05-24 14:46:00|
|AAAAAABBBBB|VD |2019-05-31 19:31:00|
|AAAAAABBBBB|VA |2019-06-26 00:00:00|
|AAAAAABBBBB|E |2019-06-26 02:00:00|
|AAAAAABBBBB|UV |2019-06-29 00:00:00|
|AAAAAABBBBB|OA |2019-07-01 00:00:00|
|AAAAAABBBBB|EE |2019-07-03 01:00:00|
+-----------+----------+-------------------+
预期产量
+-----------+----------+-------------------+
|key |statusCode|statusTimestamp |
+-----------+----------+-------------------+
|AAAAAABBBBB|UV |2019-06-29 00:00:00|
|AAAAAABBBBB|OA |2019-07-01 00:00:00|
+-----------+----------+-------------------+
我知道我可以通过设置这样的数据来解决问题,但是没有人对如何解决上述过滤器提出建议。
someDS
.groupBy("key")
.pivot("statusCode", Seq("UV", "OA"))
.agg(collect_set($"statusTimestamp"))
.thenSomeOtherStuff...
答案 0 :(得分:1)
尽管groupBy/pivot
方法可以很好地将时间戳分组,但它需要非平凡的步骤(很可能是UDF)来执行必要的过滤,然后重新扩展。这是一种不同的方法,其步骤如下:
statusCode
“ UV”或“ OA”过滤数据集statusCode
创建previous, current, and next 2 rows
的字符串Regex
模式匹配来识别所需的行下面的示例代码:
import java.sql.Timestamp
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
// Sample data:
// key `A`: requirement #3
// key `B`: requirement #2
// key `C`: requirement #4
val df = Seq(
("A", "OA", Timestamp.valueOf("2019-05-20 00:00:00")),
("A", "E", Timestamp.valueOf("2019-05-30 00:00:00")),
("A", "UV", Timestamp.valueOf("2019-06-22 00:00:00")),
("A", "OA", Timestamp.valueOf("2019-07-01 00:00:00")),
("A", "OA", Timestamp.valueOf("2019-07-03 00:00:00")),
("B", "C", Timestamp.valueOf("2019-06-15 00:00:00")),
("B", "OA", Timestamp.valueOf("2019-06-25 00:00:00")),
("C", "D", Timestamp.valueOf("2019-06-01 00:00:00")),
("C", "OA", Timestamp.valueOf("2019-06-30 00:00:00")),
("C", "UV", Timestamp.valueOf("2019-07-02 00:00:00"))
).toDF("key", "statusCode", "statusTimestamp")
val win = Window.partitionBy("key").orderBy("statusTimestamp")
val df2 = df.
where($"statusCode" === "UV" || $"statusCode" === "OA").
withColumn("statusPrevCurrNext2", concat(
coalesce(lag($"statusCode", 1).over(win), lit("")),
lit("#"),
$"statusCode",
lit("#"),
coalesce(lead($"statusCode", 1).over(win), lit("")),
lit("#"),
coalesce(lead($"statusCode", 2).over(win), lit(""))
))
让我们看一下df2
(步骤1
和2
的结果):
df2.show(false)
// +---+----------+-------------------+-------------------+
// |key|statusCode|statusTimestamp |statusPrevCurrNext2|
// +---+----------+-------------------+-------------------+
// |B |OA |2019-06-25 00:00:00|#OA## |
// |C |OA |2019-06-30 00:00:00|#OA#UV# | <-- Req #4: Ends with `#UV#`
// |C |UV |2019-07-02 00:00:00|OA#UV## | <-- Req #4: Ends with `#UV##`
// |A |OA |2019-05-20 00:00:00|#OA#UV#OA |
// |A |UV |2019-06-22 00:00:00|OA#UV#OA#OA | <-- Req #3: Starts with `[^#]*#UV#`
// |A |OA |2019-07-01 00:00:00|UV#OA#OA# | <-- Req #3: starts with `UV#`
// |A |OA |2019-07-03 00:00:00|OA#OA## |
// +---+----------+-------------------+-------------------+
应用步骤3
:
df2.
where($"statusPrevCurrNext2".rlike("^[^#]*#UV#.*|^UV#.*|.*#UV#+$")).
orderBy("key", "statusTimestamp").
show(false)
// +---+----------+-------------------+-------------------+
// |key|statusCode|statusTimestamp |statusPrevCurrNext2|
// +---+----------+-------------------+-------------------+
// |A |UV |2019-06-22 00:00:00|OA#UV#OA#OA |
// |A |OA |2019-07-01 00:00:00|UV#OA#OA# |
// |C |OA |2019-06-30 00:00:00|#OA#UV# |
// |C |UV |2019-07-02 00:00:00|OA#UV## |
// +---+----------+-------------------+-------------------+