BigQuery根据最接近的timerstamp和匹配值组合表

时间:2016-11-03 23:45:15

标签: mysql sql google-bigquery

我有两个表格,每行表格 numberTwo 我需要在表格 numberOne 中获得具有相同提示 > cod 值,以及那些在比较 time1 time2 时具有最近时间的值。

为了更容易理解我需要做的是:

表编号1:

|  id |  cod  |   hint  |           time1         |
---------------------------------------------------
|  1  |  ABC  |    V    | 2016-11-03 18:00:00 UTC |
|  2  |  ABC  |    W    | 2016-11-03 12:00:00 UTC |
|  3  |  CDE  |    X    | 2016-11-03 19:00:00 UTC |
|  4  |  CDE  |    Y    | 2016-11-03 19:30:00 UTC |
|  5  |  EFG  |    Z    | 2016-11-03 18:00:00 UTC |

表格编号

|  id |  cod  |   value  |         time2           |
----------------------------------------------------
|  1  |  ABC  |   xyz2   | 2016-11-03 18:20:00 UTC |
|  2  |  ABC  |   h323   | 2016-11-03 11:30:00 UTC |
|  3  |  ABC  |   rewq   | 2016-11-03 09:00:00 UTC |
|  4  |  CDE  |   abce   | 2016-11-03 19:10:00 UTC |

因此,对于表 numberTwo 第1行,我会得到表 numberOne 中的所有行 cod:ABC

|  1  |  ABC  |    V    | 2016-11-03 18:00:00 UTC |
|  2  |  ABC  |    W    | 2016-11-03 12:00:00 UTC |

在这些之间,我会从 time2 获得最接近时间戳的那个:

|  1  |  ABC  |    V    | 2016-11-03 18:00:00 UTC |

处理完每一行后,我会有一个这样的表:

所需表格

|  id |  cod  |   hint  |   value  |         time2           |
--------------------------------------------------------------
|  1  |  ABC  |    V    |   xyz2   | 2016-11-03 18:20:00 UTC |
|  2  |  ABC  |    W    |   h323   | 2016-11-03 11:30:00 UTC |
|  3  |  ABC  |    W    |   rewq   | 2016-11-03 09:00:00 UTC |
|  4  |  CDE  |    X    |   abce   | 2016-11-03 19:10:00 UTC |

2 个答案:

答案 0 :(得分:2)

用于BigQuery标准SQL - 请尝试以下

您可以使用示例数据取消注释已注释的块以进行快速测试

WITH 
/*    
TableNumberOne AS (
  SELECT 1 AS id, 'ABC' AS cod, 'V' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 UNION ALL
  SELECT 2 AS id, 'ABC' AS cod, 'W' AS hint, TIMESTAMP '2016-11-03 12:00:00 UTC' AS time1 UNION ALL
  SELECT 3 AS id, 'CDE' AS cod, 'X' AS hint, TIMESTAMP '2016-11-03 19:00:00 UTC' AS time1 UNION ALL
  SELECT 4 AS id, 'CDE' AS cod, 'Y' AS hint, TIMESTAMP '2016-11-03 19:30:00 UTC' AS time1 UNION ALL
  SELECT 5 AS id, 'EFG' AS cod, 'Z' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 
),
TableNumberTwo AS (
  SELECT 1 AS id, 'ABC' AS cod, 'xyz2' AS value, TIMESTAMP '2016-11-03 18:20:00 UTC' AS time2 UNION ALL
  SELECT 2 AS id, 'ABC' AS cod, 'h323' AS value, TIMESTAMP '2016-11-03 11:30:00 UTC' AS time2 UNION ALL
  SELECT 3 AS id, 'ABC' AS cod, 'rewq' AS value, TIMESTAMP '2016-11-03 09:00:00 UTC' AS time2 UNION ALL
  SELECT 4 AS id, 'CDE' AS cod, 'abce' AS value, TIMESTAMP '2016-11-03 19:10:00 UTC' AS time2 
),
*/
tempTable AS (
  SELECT 
    t2.id, t2.cod, t2.value, t2.time2, t1.hint, 
    ROW_NUMBER() OVER(PARTITION BY t2.id, t2.cod, t2.value 
                      ORDER BY ABS(TIMESTAMP_DIFF(t2.time2, t1.time1, SECOND))) AS win
  FROM TableNumberTwo AS t2
  JOIN TableNumberOne AS t1
  ON t1.cod = t2.cod
)
SELECT id, cod, hint, value, time2
FROM tempTable
WHERE win = 1

答案 1 :(得分:0)

  

还有其他方法吗?因为如果我使用左连接(包括在   其他问题)68的计费等级基本上是无限的   (需要4628414464或更高版本。)并且不断上升   无法运行查询

我看到的区域很少

1 - ABS(TIMESTAMP_DIFF(t2.time2, t1.time1, SECOND)) - 此功能正在为连接中的所有排列运行。您可能希望尝试将单独的子选择中的每个表的相应时间字段转换为秒,而不是使用它而不是原始表 - 因此您只需要执行ABS(t2.time2inSeconds - t1.time1inSeconds)

之类的操作

2 - ROW_NUMBER()的使用是另一个潜在的开支来源 - 请参阅下面的查询,我试图完全重写逻辑 - 但这是非常盲目的尝试,因为我无法测试它,看看这是否真正修复或改善与否。如果您可以尝试让结果(结算等级)

,那就太棒了
WITH 
/*    
TableNumberOne AS (
  SELECT 1 AS id, 'ABC' AS cod, 'V' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 UNION ALL
  SELECT 2 AS id, 'ABC' AS cod, 'W' AS hint, TIMESTAMP '2016-11-03 12:00:00 UTC' AS time1 UNION ALL
  SELECT 3 AS id, 'CDE' AS cod, 'X' AS hint, TIMESTAMP '2016-11-03 19:00:00 UTC' AS time1 UNION ALL
  SELECT 4 AS id, 'CDE' AS cod, 'Y' AS hint, TIMESTAMP '2016-11-03 19:30:00 UTC' AS time1 UNION ALL
  SELECT 5 AS id, 'EFG' AS cod, 'Z' AS hint, TIMESTAMP '2016-11-03 18:00:00 UTC' AS time1 
),
TableNumberTwo AS (
  SELECT 1 AS id, 'ABC' AS cod, 'xyz2' AS value, TIMESTAMP '2016-11-03 18:20:00 UTC' AS time2 UNION ALL
  SELECT 2 AS id, 'ABC' AS cod, 'h323' AS value, TIMESTAMP '2016-11-03 11:30:00 UTC' AS time2 UNION ALL
  SELECT 3 AS id, 'ABC' AS cod, 'rewq' AS value, TIMESTAMP '2016-11-03 09:00:00 UTC' AS time2 UNION ALL
  SELECT 4 AS id, 'CDE' AS cod, 'abce' AS value, TIMESTAMP '2016-11-03 19:10:00 UTC' AS time2 
),
*/
tempTable1 AS (
  SELECT 
    t2.id, t2.cod, t2.value, 
    MIN(ABS(TIMESTAMP_DIFF(t2.time2, t1.time1, SECOND))) AS delta 
  FROM TableNumberTwo AS t2
  JOIN TableNumberOne AS t1
  ON t1.cod = t2.cod
  GROUP BY t2.id, t2.cod, t2.value
),
tempTable2 AS (
  SELECT a.id, a.cod, a.value, a.time2, b.delta
  FROM TableNumberTwo AS a 
  JOIN tempTable1 AS b 
  ON a.id = b.id AND a.cod = b.cod AND a.value = b.value
)
SELECT a.id, a.cod, t1.hint, a.value, a.time2
FROM tempTable2 AS a
JOIN TableNumberOne AS t1
ON t1.cod = a.cod AND ABS(TIMESTAMP_DIFF(a.time2, t1.time1, SECOND)) = delta   

注意:上面的查询仍然应该是完整的,因为它可以从tableOne返回几个匹配的行,这些行与tableTwo中的相应行相同。但就目前而言 - 只是为了验证成本问题至少是固定或改进的

3 - 顺便说一下,它也可能是您的偏差数据等。