Question

我正在尝试扩展previous question的结果，但无法弄清楚如何实现我的新目标。

之前，我想要键入标志匹配或字符串匹配。现在，我想从运行开始创建一个唯一的分组键，标志为true或者true标志值运行之前的第一个字符串匹配。

以下是一些示例数据：

val msgList = List("b", "f")

val df = spark.createDataFrame(Seq(("a", false), ("b", false), ("c", false), ("b", false), ("c", true), ("d", false), ("e", true), ("f", true), ("g", false)))
              .toDF("message", "flag")
              .withColumn("index", monotonically_increasing_id)

df.show

+-------+-----+-----+
|message| flag|index|
+-------+-----+-----+
|      a|false|    0|
|      b|false|    1|
|      c|false|    2|
|      b|false|    3|
|      c| true|    4|
|      d|false|    5|
|      e| true|    6|
|      f| true|    7|
|      g|false|    8|
+-------+-----+-----+

所需的输出相当于key1或key2：

+-------+-----+-----+-----+-----+
|message| flag|index| key1| key2|
+-------+-----+-----+-----+-----+
|      a|false|    0|    0| null|
|      b|false|    1|    1|    1|
|      c|false|    2|    1|    1|
|      b|false|    3|    1|    1|
|      c| true|    4|    1|    1|
|      d|false|    5|    2| null|
|      e| true|    6|    3|    2|
|      f| true|    7|    3|    2|
|      g|false|    8|    4| null|
+-------+-----+-----+-----+-----+

从我上一个问题的答案来看，我已经有了一个先例：

import org.apache.spark.sql.expressions.Window

val checkMsg = udf { (s: String) => s != null && msgList.exists(s.contains(_)) }

val df2 = df.withColumn("message_match", checkMsg($"message"))
            .withColumn("match_or_flag", when($"message_match" || $"flag", 1).otherwise(0))
            .withColumn("lead", lead("match_or_flag", -1, 1).over(Window.orderBy("index")))
            .withColumn("switched", when($"match_or_flag" =!= $"lead", $"index"))
            .withColumn("base_key", last("switched", ignoreNulls = true).over(Window.orderBy("index").rowsBetween(Window.unboundedPreceding, 0)))

df2.show

+-------+-----+-----+-------------+-------------+----+--------+--------+
|message| flag|index|message_match|match_or_flag|lead|switched|base_key|
+-------+-----+-----+-------------+-------------+----+--------+--------+
|      a|false|    0|        false|            0|   1|       0|       0|
|      b|false|    1|         true|            1|   0|       1|       1|
|      c|false|    2|        false|            0|   1|       2|       2|
|      b|false|    3|         true|            1|   0|       3|       3|
|      c| true|    4|        false|            1|   1|    null|       3|
|      d|false|    5|        false|            0|   1|       5|       5|
|      e| true|    6|        false|            1|   0|       6|       6|
|      f| true|    7|         true|            1|   1|    null|       6|
|      g|false|    8|        false|            0|   1|       8|       8|
+-------+-----+-----+-------------+-------------+----+--------+--------+

base_key这里有点接近key1，但是为第1行和第3-4行分配了单独的键。我希望第1-4行能够根据第1行包含msgList = flag内部或之前的第一个true匹配这一事实获得单个密钥。

查看Spark window function API，看起来可能有某种方法可以使用rangeBetween来完成Spark 2.3.0的实现，但是文档已经足够了我无法使用弄清楚如何使它工作。

Spark窗口函数：窗口

0 个答案: