使用大查询将前两行与当前行和下一行进行比较

时间:2019-08-27 12:26:29

标签: google-bigquery bigquery-standard-sql

我有如下所示的数据

rno id day  val
0   1   1   7
1   1   2   5
2   1   3   10
3   1   4   10
4   1   5   11
5   1   6   11
6   1   7   14
7   1   8   14
20  2   1   5
21  2   2   7
22  2   3   8
23  2   4   8
24  2   5   9
25  2   6   9
26  2   7   13
27  2   8   13
28  2   9   15
29  2   10  15

我想根据以下两个规则将新列创建为fake_flag,并将值填充为fake_val

规则1 -对于每个值(n),请检查前两行(n-1n-2)是恒定的还是递减的(例如: 7,5或5,5是有效的,而5,7是无效的,因为它在增加并且也不是常数),并获得最大值作为输出。如果是7,5,则输出为7。如果是5,5,则输出为5

规则2 -检查当前值(n)和下一个值(n+1)是否比规则1输出的最大值大3点或更多点(> = 3)。例如:如果规则1的输出为5,那么我们希望看到至少8(n),8(n+1)。可能是9,9或10,10

我希望我的输出数据如下图所示

rno id day  val fake_flag
0   1   1   7     
1   1   2   5     
2   1   3   10    fake_val  # >= 3 from max of preceding 2 rows and `n` and `n+1` is same 
3   1   4   10     
4   1   5   11
5   1   6   11
6   1   7   14    fake_val  # >= 3 from max of preceding 2 rows and `n` and `n+1` is same 
7   1   8   14
20  2   1   5
21  2   2   7
22  2   3   8
23  2   4   8
24  2   5   9
25  2   6   9
26  2   7   13    fake_val    # >= 3 from max of preceding 2 rows and `n` and `n+1` is same 
27  2   8   13
28  2   9   15
29  2   10  15

2 个答案:

答案 0 :(得分:2)

这应该完成您想要的。我用虚拟数据进行了测试,但是如果我不了解某些内容,请告诉我,我可以进行修改。

Select *
, CASE WHEN 
  -- Rule 1
  (LAG(val, 1) over w <= LAG(val, 2) over w)  AND 
  (val = LEAD(val, 1) over w) AND -- n = n + 1, part of rule 2
   -- Can assume row n-2 is the max because it will either be the same as row n-1 or greater than row n-1 for rule 1 to be satisfied
  (LAG(val, 2) over w <= val + 3) -- Only have to check current row val because for first part of rule 2 to be satisfied val for row n must equal val for row n + 1
  THEN 'fake_val' -- I would just have a 1 representing it is true and then 0 if not, but up to you 
  ELSE null 
  END as fake_flag
from Dataset.Table_name
WINDOW w as (ORDER BY rno ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)

答案 1 :(得分:1)

以下是用于BigQuery标准SQL

#standardSQL
SELECT rno, id, day, val, 
  IF(IFNULL(val_prev2 > val_prev1, FALSE)                   -- rule 1
    OR ( 
      (val - GREATEST(val_prev2, val_prev1) >= 3)           -- rule 2 for val(n)
      AND (val_next - GREATEST(val_prev2, val_prev1) >= 3)  -- rule 2 for val(n+1)
    ), 
    'fake_val', ''
  ) AS fake_flag
FROM (
  SELECT *,
    LAG(val) OVER(PARTITION BY id ORDER BY day) val_prev1,
    LAG(val, 2) OVER(PARTITION BY id ORDER BY day) val_prev2,
    LEAD(val) OVER(PARTITION BY id ORDER BY day) val_next
  FROM `project.dataset.table`
)

如果要应用于您的问题的样本数据-结果为

Row rno id  day val fake_flag    
1   0   1   1   7        
2   1   1   2   5        
3   2   1   3   10  fake_val     
4   3   1   4   10       
5   4   1   5   11       
6   5   1   6   11       
7   6   1   7   14  fake_val     
8   7   1   8   14       
9   20  2   1   5        
10  21  2   2   7        
11  22  2   3   8        
12  23  2   4   8        
13  24  2   5   9        
14  25  2   6   9        
15  26  2   7   13  fake_val     
16  27  2   8   13       
17  28  2   9   15       
18  29  2   10  15