Question

假设您具有如下数据框：

+---+----------+----------+
| id|    date_a|    date_b|
+---+----------+----------+
|  1|2020-01-30|2020-01-19|
|  1|2020-01-10|2020-01-19|
|  1|2020-01-10|2020-01-26|
|  1|2020-01-30|2020-01-26|
|  2|2020-01-05|2020-01-08|
|  3|2020-01-08|2020-01-10|
|  3|2020-01-12|2020-01-10|
+---+----------+----------+

对于每个ID，都有不同组合的date_a和date_b值。

我想过滤条目，对于单个ID，date_b位于所有date_a周围的特定设置时间范围之外。

id = 1的视觉解释看起来像（水平是时间轴）：

| --- x --- | o | -o--x --- |

，其中x = date_a，o = date_b，并且| --- --- |表示时间范围（即+-5天）。
因此，应保留“ o”（date_b）条目，该条目不在date_a时间范围内（此处为第一个“ o”）。

示例输入/输出：

输入：

df = spark.createDataFrame(
    [(1, '2020-01-10', '2020-01-19'), 
     (1, '2020-01-10', '2020-01-26'),
     (1, '2020-01-30', '2020-01-19'),
     (1, '2020-01-30', '2020-01-26'),    
     (2, '2020-01-05', '2020-01-08'),
     (3, '2020-01-08', '2020-01-10'),
     (3, '2020-01-12', '2020-01-10'),],
     ['id', 'date_a', 'date_b']
)

df = df.withColumn('date_a', F.to_date('date_a'))
df = df.withColumn('date_b', F.to_date('date_b'))
df = df.withColumn('diff', F.datediff(df.date_b, df.date_a))
df.orderBy('id', 'date_b').show()

+---+----------+----------+----+
| id|    date_a|    date_b|diff|
+---+----------+----------+----+
|  1|2020-01-30|2020-01-19| -11|
|  1|2020-01-10|2020-01-19|   9|
|  1|2020-01-30|2020-01-26|  -4|
|  1|2020-01-10|2020-01-26|  16|
|  2|2020-01-05|2020-01-08|   3|
|  3|2020-01-08|2020-01-10|   2|
|  3|2020-01-12|2020-01-10|  -2|
+---+----------+----------+----+

在相同的id中，对于所有具有相同date_b的行，我们希望获得diff的位置，其中>5 or <-6是date_b（{{ 1}}在间隔date_b之外。）
即：
对于[date_a - 6, date_b + 5]，（11> 5 | 11 <-6）和（9> 5 | 9 <-6）->保留输入（True＆True）
对于id=1, date_b='2020-01-19'，（4> 5 | 4 <-6）和（16> 5 | 16 <-6）->条目被丢弃（假和真）
...

预期输出：

id=1, date_b='2020-01-26'

Answer 1

这是一种可能的方法，您可以尝试（内嵌评论）：

w = Window.partitionBy("id","date_b").orderBy("id")
cond = (F.col("diff")>5) | (F.col("diff")<-6)

#check if condition is true and get sum over the window
sum_of_true_on_w = F.sum(cond.cast("Integer")).over(w) 

#get window size to compare with the sum , there might be a better way to get size
size_of_window = F.max(F.row_number().over(w)).over(w)

#filter where sum over the window is equal to size of window
(df.withColumn("Sum_bool",sum_of_true_on_w)
   .withColumn("Window_Size",size_of_window)
   .filter(F.col("Sum_bool")==F.col("Window_Size"))
   .drop("diff","Sum_bool","Window_Size")).show()

+---+----------+----------+
| id|    date_a|    date_b|
+---+----------+----------+
|  1|2020-01-10|2020-01-19|
|  1|2020-01-30|2020-01-19|
+---+----------+----------+

通过将单个行元素与窗口的所有行进行比较来过滤火花窗口

1 个答案: